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In this Issue 

Early last year, Hewlett-Packard introduced a family of new workstation com- 
puters that surprised the workstation world with their high performance — a 
huge increase over the previous industry leaders — and their low prices. On 
standard industry benchmarks, the HP Apollo 9000 Series 700 computers outdis- 
tanced the competition by a wide margin. The speed of the Series 700 machines 
can be attributed to a combination of three factors. One is a new version of HP's 
PA-RISC architecture called PA-RISC 1.1. (The PA stands for precision architec- 
ture and the RISC stands for reduced instruction set computing,) PA-RISC 1.1 
was worked on by teams from HP and the former Apollo Computers, Incorpo- 
rated r then newly acquired by HR It includes several enhancements specifically 
aimed at improving workstation performance. The second factor in the new computers' speed is a new set 
of very large-scale integrated circuit chips capahle of operating at clock rates up to 66 megahertz. Called 
PCX-S, the chipset includes a 577,000-transistor CPU (central processing unit), a 640 r 000 -t r a n s i sto r float- 
ing-point coprocessor, and a 185,000'transistor memory and system bus controller. The third factor is a 
new version of the HP-UX operating system that takes advantage of the architectural enhancements of 
PA-RISC 1,1 and offers additional compiler optimizations to make programs run faster 

The Series 700 hardware design story will appear in our next issue (August), In this issue we present the 
software part of the Series 700 speed formula. The article on page 6 summarizes the architectural en- 
hancements of PA-RISC 1.1 and tells how the kernel of the HP-UX operating system was modified to take 
advantage of them. The article on page 1 1 describes the development process for the kernel modifica- 
tions, which was tuned to meet an aggressive schedule without compromising quality. This article in- 
cludes a brief description of the overall management structure for the Series 700 development project 
which is now considered within HP to be a model for future short-time-to-market projects. An overview of 
the additional compiler optimizations included in the new HP-UX release is provided by the article on page 
15, along with performance data showing how the compiler enhancements improve the benchmark perfor- 
mance of the Series 700 workstations, A new optimizing preprocessor for the FORTRAN compiler that im- 
proves performance by 30% is described in the article on page 24. Optimization techniques called register 
reassociation and software pipelining, which help make program loops execute faster, are offered by the 
new compiler versions and are described in the articles on pages 33 and 39, respectively, The new release 
of the HP-UX operating system is the first to offer shared libraries, which significantly reduce the use of 
disk space and allow the operating system to make better use of memory. The HP-UX implementation of 
shared libraries is described in the article on page 46. 

The three research reports in this issue are based on presentations given at the 1991 HP Technical Women's 
Conference. The first paper (page 54) discusses the integration of an electronic dictionary into HP-NL, HP's 
natural language understanding system, which was under development at HP Laboratories from 1982 to 1991. 
Dictionaries are important components of most computational linguistic products, such as machine translation 
systems, natural language understanding systems, grammar checkers, spelling checkers, and word analyzers, 
Electronic dictionaries began as word lists and have been evolving, becoming more complex and flexible in 
response to the needs of linguistic applications. While the electronic dictionary integrated into HP-NL was one 
of the more advanced and greatly increased the systems capabilities, the integration was not without prob- 
lems, which the researchers feel should help guide the potential applications of electronic dictionaries, The 
paper concludes with a survey of applications that can use electronic dictionaries today or in the future. 
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The paper on page 68 presents the results of research on automated laser printer print quality measurement 
using spatial frequency methods. Printers using different print algorithms, dot sizes, stroke widths, resolutions, 
enchancement techniques, and toners were given a test pattern to print consisting of concentric circles 
spaced progressively closer (higher spatial frequency} with increasing radius. The printed test patterns were 
analyzed by optical methods and measures of relative print quality were computed. These machine evaluations 
were then compared with the judgments of fourteen trained human observers shown printed samples from the 
same printers. In all cases, the human jury agreed with the machine evaluations. The method is capable of 
showing whether printer changes can be expected to improve text graphics, neither, or both. 

Computer graphics rendering is the synthesis of an image on a screen from a mathematical model contained in 

a computer. Photorealistic renderings, which are produced using glohaf illumunation models, are the most ac- 
curate, but they are computation-intensive, requiring minutes for simple models and hours for complex sub- 
jects The paper on page 76 presents the results of simulations of an experimental parallel processor architec- 
ture for photorealistic rendering using the raytracing rendering technique. The results so far indicate that four 
processors operating in parallel can speed up the rendering process by a factor of three. Work continues at HP 
Laboratories to develop actual hardware to test this architectural concept. 

R.P. Dolan 
Editor 



Cover 

The cover shows an artist's rendition of the transformations that take place when source code goes through 
register reassociation and software pipelining compiler optimizations. The multiple-loop flowchart represents 
the original source code, the smaller flowchart represents the optimization performed on the innermost loop by 
register reassociation, and the diagram in the foreground represents software pipelining. 



What's Ahead 

The August issue will present the hardware design of the HP Apollo 9000 Series 700 workstation computers. 
Also featured will be the design and manufacturing of the new color print cartridge for the HP DeskJet 500C 
and DeskWriter C printers, and the driver design for the DeskWriter C. There will also be an article on the HP 
MRP Action Manager which provides an interactive user interface for the HP MM materials management 
software. 
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HP-UX Operating System Kernel 
Support for the HP 9000 Series 700 
Workstations 



Because much of the Series 700 hardware design was influenced by the 
system's software architecture, engineers working on the kernel code 
were abte to make changes to the kernel that significantly improved 
overall system performance, 

by Karen Kerschen and Jeffrey R. Glasson 



When the HP 90(H) Series 700 computers were introduced, 
we in the engineering and learning products organization 
in the I1P-UX* kernel laboratory had a chance to see how 
our year-long project stacked up against the competition. 
In a video, we watched a Mode! 720 workstation pitted 
against one of our Comparably priced competitors sys- 
tems. Both systems were running Inigraphies, which is a 
suite of compute- intensive mechanical modeling programs 
developed by McDonnell Douglas Corp. The two comput- 
ers converted images of a General Motors Corvette ZR1 
from two to three dimensions, rotated the drawings, 
contoured the surfaces, and recreated a four-view layout.. 
The Model 720, the lowest-priced of our new systems, 
performed over eight limes faster than the competition. 

Tin Series 700 is based on the first processor to imple- 
ment the PA -RISC 1.1 architecture, which includes en- 
hancements designed specifically for the technical needs 
of the workstation market. This was a departure from the 
previous HP processor design, which served general 
computation needs. 

The new system SPT (system processing unit) features 
three new chips: an integer processor, a floating -point 
coprocessor, and a memory and system has controller. In 
addition, the Series 700 was developed to provide I/O 
expandability through the Extended Industry Standard 
Architecture (EISA) bus. For the software project teams, 
this new hardware functionality raised some basic ques- 
tions, such as "What can the user do with these hardware 
capabilities?*' and "What can w r e do to take advantage of 
the hardware features?" The answer to the firsl question 
was fairly obvious because we knew that key users 
w r ould be engineers running CAE application programs 
such as compute-intensive graphics for modeling mechani- 
cal engineering designs. We also realized that the Series 
700 systems were not intended as specialized systems, 
but were aimed at a broad spectrum of high-performance 
workstation applications, and tiny had to he fast every- 
where, without making trade-offs to computer-aided 
design, Thus, addressing the second question gave 
direction to the year-long soil ware development effort. 



The engineering challenges faced by our kernel develop- 
ment teams were to identify the new features of the 
hardware thai could be exploited by the operating sys- 
tem, and then to add or alter the kernel code to lake 
advantage of these features. By studying the hardware 
inn ova! ions, the software Team identified four areas for 
kernel modification: CPU related changes* floating-point 
extensions, TLB (translation lookaside buffer) miss 
routines, and I/O and memory controller changes. Under- 
lying the entire effort was an essential factor — perfor- 
mance. To succeed in the marketplace, the Series 7U\) had 
to have very fast response lime and throughput. 

The Series 700 performance accomplishments w r ere 
achieved by a working partnership between hardware and 
software engineers. Both realized that an integrated 
system approach was key to making the Series 700 a 
high-performance machine. New hardware components 
were engineered to ensure a balanced system, which 
meant that I/O performance matched CPU performance. 
Software architecture w r as considered in designing the 
hardware, and much of the hardware suggested opportu- 
nities for streamlining throughput and response time 
through changes in the kernel code. 

The hardware architecture of the Series 700 is shown in 
Fig. 1. Each of these components is architected to ensure 
that the software runs faster The resi of this article 
describes the changes to the kernel code to take advan- 
tage of the Series 700 hardware features The manage- 
ment st met ure and development process are described in 
the article or page 11. 

CPU Related Changes to Kernel Code 

From a hardware perspective, the CPU chip performs all 
processor functions (except floatingpoint) including 
integer arithmetic (except multiplication), branch process- 
ing, interrupt processing, data and instruction cache 
control and data and instruction memory management. 
Additional interrupt processing and cache flush instruc- 
tions were added to the hardware, along with each*, hints 
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Fig. 1 . ■- ■ ■ »k diagram of ihe HP 9QQ0 Series 7(10 hardware ("TLBs 
are translation lookaside buffers). 

to prefetch cache lines from memory (see the article on 
page 15 for more Information about cache hints). 

The software release for the Series 700 operating system 
was designed to address key features of the CPU chip. To 
tailor the kernel to the CPU's capabilities required the 
following changes: 

• Emulation of floating-point instructions, which also sup- 
ports I he floating-point coprocessor enhancements 

■ Cache [Tush Instructions to the WO and memory controller 
for the benefit of graphics applications 

• Shadow register for improved TLB (translation look- 
aside buffer}* miss handling 

• 4 K- byte page size to reduce TLB miss rate 

• Sparse PD1R (page directory), which reduces overhead 
for the EISA I/O address space and is faster 

• New block TLB entries to map the kernel and graphics 
frame buffers. 

Emulation of Floating-Point Instructions 

Although all Series 700 systems have floating point 
hardware, kernel instructions can HOW emulate all the 
new floating-point instructions in software, This redundan 
r v was designed into the software to deal with floating- 
point exceptions. PA-RISC Li was defined to allow 
hardware designers the freedom to implement what they 
wanted efficiently, while still providing a consistent view 
of the system to software. If someone executes an 
instruction, the system doesn't care whether it was done 
in hardware or software — the result is functionally ideriti 
< al although performance differs. The computation 
proceeds much more slowly in software than in hard- 
ware but this solution provides h machine without a 
floating-point coprocessor that can still execute the 
floating-point instructions and be binary compatible. 

fUe software implementation capability also provides 
certain classes of operations that the hardware cannot 



execute. For example, the Series TOO floating-point 
coprocessor cannot multiply and divide denorniaJized 
numbers.** When it encounters deformalized numbers, 
the hardware generates an assist trap to signal the 
operating system to emulate the required instruction. 

Software engineers modified the kernel to accommodate 
the expanded Qoating-point register file and to make 
these registers accessible as destinations. The additional 
registers allow ntore floating-point data to be accessed 
quickly, which reduces the systems need to access 
memory in floating-point-intensive applications, 

Cache and Cache Flush Instructions 

The Series 700 system has separate instruction and data 
caches (sec Fig. 1). This design allows better pipelining 
of instructions that reference data by giving two pons to 
the CPU's ALL" (arithmetic logic unit). This amounts to a 
degree of parallel processing in the CPU. To maximize 
this parallel processing, both cache arrays interface 
directly to the CPU and the floating-point coprocessor. 

The data path from the CPU to the data caches was 
widened from 32 to 61 bits. This allows two words to be 
transferred iit one cycle between memory and registers, 
The operating system exploits the 64-bit-wide data path to 
allow higher throughput between the CPV and memory; 
The operating system also takes advantage of the wid- 
ened data path when using floating-point double- word 
LOADs, STORES, and quad-word STORES to COPY and ZERO 
data in the kernel. 

New cache flush Instructions have been added to access 
special dedicated hardware in the memory and system 
bus controller (discussed in detail later in this article). 
This hardware does direct memory access (DMA) block 
moves to ant I from memory without involving the CPU, It 
also handles color interpolation and hidden surface 
removal These features benefit graphics applications, 
which use the enhanced cache flush instructions t<> 
aCCeSS data more efficiently. 

Shadow Registers 

Another CPU feature is tin' addition of shadow registers. 
Shadow registers are extensions of the processor that 
reduce the number of instructions needed to process 
certain interrupts, particularly TLB misses. The new 
PA-RISC processor shadows seven general registers. 
Without shadow registers, when the processor receives an 
interrupt the operating sysiiru must save (reserve) some 
registers before they ran he used to service the interrupt. 
This is because the operating system has no idea how the 
general registers are being used at the time of the inter- 
rupt (A user program might be running or executing a 
system call in the kernel) Shadow registers eliminate the 
need for the operating system to store registers before 
they are used in the interrupt handler. The CPU automati- 
cally stores the shadowed registers when ihe interrupt 
occurs and before the processor jumps lo the interrupt 
handler. This short errs the interrupt handlers by several 



■ A translation lookaside buffer cir TLB is a hardware address rransiannn table the TI.H 
and cache memory Typically provide an interface tq the memory system fnr PA-RISC 
processor The TLB speetis up ■■■n'u.ji io rial address trattstatfons by BCffn&as'fl cache Ett 

i ions Mare de tar led information abuut the TLB can be found in reference 1 



*ln The IEEE 754 floating-point standard, a denormalued number is a nonzero lloating-prjint 
number whose exponent has a reserved value, usually the fomsBTS minimum, and whose 
explicit or implicit leading signifies^ iir i zero 
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instructions. Shadow registers are used for TLH interrupts. 
which are the most time critical interrupts. 

The CPU has another new instruction— RFIR (return from 
interrupt and restore) — to return from an interrupt and 
restore (copy) the shadow registers back to the Mineral 
registers, RFIR exists along with RFI (return from inter- 
rupt), which doesn't copy the shadow registers. RFIR has 
specific and limited applicability to TLB interrupts be- 
cause interrupts using the shadow registers cannot be 
nested. Most of the time, the operating system still 
uses RFI. 

4K-Byte Page Size 

To further improve memory access, the page size was 
increased from 2K bytes to 4K bytes. This reduces the 
number of TLB misses a typical application will encoun- 
ter. In the software, changes were made in the low levels 
of the operating systems virtual memory subsystem. A lot 
of work was done so that the PA-RISC' 1.0 systems, which 
have a 2K-byte page size, can have a logical 4 K- byte page 
size. 

Sparse Page Directory 

If we had used the old page directory [PDIR) architecture 
that maps virtual to physical pages of memory; gaps in 
the EISA address space would have wasted a significant 
amount of physical memory to store unused PDIR entries. 
Therefore, it was decided to redefine the page directory 
from an array to a linked list. Mow, instead of taking the 
virtual address and calculating an offset in the table, a 
hash function produces a pointer to a page directory 
entry (PDE) that corresponds to the physical address. In 
most cases, the hashing algorithm produces a direct 
mapping to the point in the table. In some cases* such as 
a hash collision, the first PDE on the list has to link to 
another PDE as shown in Fig, 2. 

if the translation does not exist in the PDIR t a PDE is 
taken off the PDE free list and inserted into the correct 
hash chain. The sparse PDIR reduces the amount of 
memory needed to store the page tables. 



TLB Miss Improvements 

The TLB t which is on the processor chip, consists of two 
96-entry fully associative TLBs — one tor instructions and 
one for data. Each of these TLBs has block TLB entries — 
four each for instructions and data, Each fully associative 
cntiy maps only a single page, while each block entry is 
capable of mapping large contiguous ranges of memory, 
from 128K bytes to 16M bytes. These block entries help 
reduce TLB misses by permanently mapping large por- 
tions ol the operating system and graphics frame buffer. 

Block entries reduce the number of total TLB entries 
used by the operating system. Block entry mapping leaves 
more general TLB entries for user programs and data, 
thus reducing the frequency of TLB misses and improving 
overall system performance. We map most of the kernels 
text space and a good portion of the kernel s data using 
block TLB entries. 

TLB missus are handled differently hy the Series 70€ 
processor than in earlier processor implementations. Miss 
handler code is invoked when the TLB miss interrupt is 
generated by the processor. The processor saves some 
registers in its shadow registers and transfers control to 
the software TLB miss handler. The miss handler hashes 
into the sparse PDIR in memory to find a virtual -to-physical 
translation of the address that caused the interrupt. If it 
finds it, the translation is installed in the TLB and the 
transaction is retried. If it doesn't find it, page fault code 
is executed, (In another case, protection identifiers, which 
govern access rights, might prevent translation, that is, 
the address might exist but the process trying to access 
the data might not have access rights to the data.) 

Floating-Point Coprocessor Extensions 

The PA-RISC 1.1 architecture features a floating-point 
coprocessor with an extended floating-point instruction 
set dial has 32 double-precision registers (previously, 
there were 16). These registers are also accessible as 64 
single-precision registers (compared to 16 single-precision 
registers in the previous implementation). The additional 
floating-point registers add flexibility in terms of how 
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many registers a programmer can access, More data can 
be held in registers that are quickly accessible by the 

(V\ 

Many new floating-point instructions were added to the 
float tag-point instruction set to accommodate more 
demanding graphics applications and improve matrix 
manipulations. From the software perspective, the follow- 
ing areas of kernel code were changed: 

• Save states 

• Store instructions 

• Extensions to FTEST 

• Multiply instructions 

^ -point exception handling. 

Save States, When a user process gets a signal, the system 
copies the contents of all its registers (general and 
floating-point) to the user stack so that the user's signal 
handler ran access and modify them. However, to main- 
tain binary compatibility with older imp 1 emeu tat ions and 
applications, hooks were added to the kernel to identify 
the size of the save-staie data structure. Knowing the size 
of this data structure ensures thai lite system will copy 
the correct number of registers to the user stack (or copy 
back from the user stack), so that programs compiled on 
older (PA- RISC U)) hardware will run without having ro 
be recompiled on the new hardware. 

Store Instructions. Quad-word store instructions in the 

floating-point processor store four words [two double- 
word registers) at once, and execute in fewer cycles than 
two double-word store instructions. The kernel uses th» 
quad -word store instruction in copy and zero routines, if 
it detects its presence. Quad store instruction code is not 
portable to the PA-RISC 1.0 implementation or to other 
1.1 systems. 

FTEST Extensions. Extensions 60 FTEST streamline graphics 
(lip lesls, which benefits two-dimenMunil and rim-. 
dimensional graphics performance, 

FTEST is an instruction used to check (he Status of subsets 
of the nn;t!ing point compare queue. In previous imole- 
mentalious, FTEST could test onjy Ihe result of the last 
FCMP (floating-point compare). The Series TOO extends 
FCMP to keep the lasi twelve compare results in a queue. 
using hits 10 through 20 of the floating-point status 
register in addition to ihe c t>n (hit "o. FTEST now looks 
at dilTerenl pieces of ihe queue to determine wheiher Hi 
nullify ihe next instruction. An example of how the FTEST 
extension can save processor cycles is gjveiri on page 10. 

FTEST extensions are not part of PA-RISC 1.1 architecture, 
hul are s[ H'rillc to the Series 700. Therefore, any code 
using the extensions is nol portable to other PA-RISC." 

iiiipN'MiftilaTions. 

Multiply Instructions, New multiple-operation instructions, 
including mulliply-and-add f'FMPYADD). mulliply-and-sub- 
iraet (FrVfPYSUB), and inulliply-and-converl from floating- 
point formal to (fixed) integer format (FMPYCFXT). mote 
fully exploit the AW and MPY computational units in Ihe 
noating-poiui coprocessor. This approach reduces the 
number of cycles required lo exocuie lypicnJ computational 

combinations, such as multiplies and adds, 



Also, an integer multiply instruction was added to the 

instruction set to execute in the floating-point coproces- 
sor. Previously a miilicode library routine was called to 
do integer multiplies. This new implementation is much 
faster. 

Floating-Point Exception Handling. The Coating-point eopro- 

ompufan d results from the floating- 

point instructions embedded in the hardware. This pro- 
vides the biggest performance boost to graph ics, 
particularly for transformations. However, certain circum- 
stances (such as operations on denonualized numbers) 

se the hardware to generate an exception, which 
requires the kernel to emulate the instruction in software. 
The emulation Of the floatingpoint instruction set pro- 
vides much-needed back tip and auxiliary computational 
support. 

Memory and System Bus Controller 

The memory and system bus controller, which was 
implcmcnied as a new chip, has two new features de- 
signed specifically to streamline graphics functionality: 

• The ability to read or write from the graphics frame buff- 
er (video memory) to main memory using direct memory 
access (DM4) circuitry. DMA allows block moves of data 
to and from the graphics card without having lo go 
through the CPU. 

• The capability to do color interpolation (for light source 
shading) and hidden surface removal 

Accommodating the new hardw T are system bus and 
memory functionality required extensive changes to the 
kernel code. 

Besides tin- operating system changes, the IIP Starnase 
graphics driver^ were rewritten lo take advantage of 
block moves and color interpolation. These drivers are 
used h\ the X server to improve X Window r System 

performance and allow the use of lower-cost 3D graphics 

hardware. This is because the drivers use the memory 
and system bus controller In produce the graphical 
effects, rather than relying on dedicated graphics hard wart 1 . 

The memory and system bus controller features new 

ivuislrrs in its graphics portion. Kernel code was en- 
hanced to allow user processes to use these additional 
registers. Any user graphics process ran n<juest the 
kernel to map the memory and system bus controller 
registers into its address space. A new soctl system call 
parameter provides access to the memory and system bus 
controller registers. The new call maps the controller 
register set into the user's address space to enable reads 
and writes from those registers to instruct the memory 
and system bus controller to perform graphics functions 
Once Ihe user sets Up the memory and system bus 
controller registers, ihe transactions are initiated by 
issuing a special flush data cache instruction. 

Finally, new kernel code allows multiple processes to 

share the memory and system bus controller. At context- 
switch time, extra work happens if a process is using the 
controller The operating system saves and restores the 
sri of memory and system bus controller registers only if 
the process has mapped them into its address space. 
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An Example of the FTEST Instruction 

Bits 10 through 20 of the floating-point status register serve two purposes, They 
are used to return the model and revision of the coprocessor following a CO PR o.D 
instruction as defined by the PA-RISC architecture. Their other use is for a queue of 
floating-point compare results 

Whenever a floating-point compare instruction (FCMP) is executed the queue is 
advanced as follows: 

FPstatus[lI:2D] - FPstatus! 10:19] 
FPstatusHQj * FPstatusfS] 
FPstatusl5j = FCMP result [the C-Dlt| 

The REST instruction has been extended to allow checking various combinations of 
the compare queue. For example, to evaluate (fr* *= fr5} && (frE == fr7f && ffrfi 
==fr9J && (frio =*frll) would take 24 cycles to execute using the PA-RISC 1.0 
instructions: 

FCMP,- fr4 p fr5 
FTEST 
branch 
FCMP r =frG,fr7 



FTEST 

branch 

FCMMr8,fr9 

FTEST 

branch 

FCMP.^ frlGJrl 1 

FTEST 

branch 

By comparison, using the Series 700 floating-point compare queue 

FCMR- fr4 r fr5 
FCMP.= fr6.fr 7 
FCMP^Hfr9 
FCMP,= fflOJrlT 
FTEST, ACC4 
branch 

takes only 12 cycles to execute, 



Because of this, nongraphic^ processes are not penalized 
by the operating system's saving and restoring of unused 
registers. 

Conclusion 

Noted author Alvin Toffler identified an underlying 
challenge to today's computer industry in his book, Power 
shifh "From now on," he wrote, "the world will be split 
between the fast and the slow" Thanks to a successful 
partnership of hardware innovation and kernel tuning, the 
Series 700 can definitely be classified as one of the fast. 
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Providing HP-UX Kernel Functionality 
on a New PA-RISC Architecture 



To ensure customer satisfaction and produce a high-performance, 
high-quality workstation on a very aggressive schedule, a special 
management structure, a minimum product feature set, and a modified 
development process were established. 

by Donald E> Bollinger, Frank P. Lenunon, and Dawn L. Yamine 



The aggressive schedule for the development of the HP 
9000 Series 700 systems required the development team 
in the HP I X kernel laboratory to consider some modifi- 
cations to the normal software development process, the 
number of product features, and the management smu - 
lure, The goals for the product features were to change 
i>r add the minimum number of HP-UX kernel functions 
thai would ensure customer satisfaction, meet our perfor 
manee goals, and adapt to a new I/O system. This version 
of the HP-UX kernel code became known as minimum 
core functionality, or MCF. 

Series 700 Manage in en I Structure 

To accomplish I he goals set for the HP 9000 Series 700 
system required a special management structure Ihal 
included a program manager with leadership and respon- 
sibility for the whole program, and a process rhar allowed 

rapid and sound decisions fco be made. The MJ&laitmg 
management struelnre delegated the bulk of the marine 
men i io focused teams of individual developers and 
lit si level managers. The program manager owned every 
facet of the release, from the soil ware feature set to the 
allocation of prototypes and production of chips in the 
fabrication Shop Since the Series 790 program was 
multidivisiona) and Located in iwo geographical loca- 
tions, the program manager had lo maintain a desk al 
both locations, close lo the the hardware and so f I ware 
development learns. 

The rapid decision polie.v promoted by the manageineiit 
lea in enabled small teams of individual developers and 
first level managers lo make imparl ant program decisions 
quickly and directly Decision lime itself was measured 
and harked around the program. For example, the system 
team's goal was lo have no open issues over two weeks 
okL Also, the MCF kernel Irani tracked kernel defects on 
a daily basis. If a defect aged over three days, additional 
help was assigned immediately. The process In delermine 
Mm disposition of defects ran mi a lM huur r [nek. The 

defeel dam was posted in the evening, votes were col- 
lected by the team rapiain tlu' nexi morning, the team 
reviewed the voles and made decisions in I he afternoon, 
and approved fixes were incorporated into I be build that 
night 



key decision made early in the program w r as whether 
to base the kernel on HP-UX 7.0, which was stable and 
shipping, or HP-UX 8.0. which was not yet shipping to 
customers. HP-UX 8.0 offered the advantage of being the 
basis for future releases, and thus the developers and 
customers of the follow -on release to MCF could avoid 
the overhead of first having to update to 8.0. Tills was a 
critical decision. The RM) learn promoted the advantages 
of S.0, while the program manager weighed the risks. 
within two weeks the program manager and the learn 
decided lo base the operating system on HP-UX 8.0 and 
(he issue was never revisited. 

Each team worked systenrwido, with individual developers 
focusing on a fat -ei of I he system. The performance team, 
with members from hardware, kernel, languages, graphics, 
and performance measurement groups, focused on the 
" > rail goal of maximizing system performance in com- 
pulation, graphics; and I/O. The value added business 
(VAIVf learn loeused on delivering high-quality prototype 
lino I ware and software n> key VAB part tiers, allowing 
theii sofiware applicaiions io release simultaneously with 

the \U* 0000 Model 720, There was also an iiilegraiiuM 
I cam, a release team, and a quality and testing team. 

The members of these teams were not merely representa- 
tives who coiled ed action items and returned them lo 
iheir respective organizations- The team members were 
I be engineers and managers involved in l he development 
work. Thus, muhidivisional problems were solved right at 
the team meetings. 

The overall program simcture glued these teams together 

Key decisions vsere made by the program manager and 
other lop level members of (he management team, The 
system team managed fhe laclical issues and the COO* 
dination of the focused teams. Mosl people were mem 
bers of multiple teams, providing crucial linkage between 
individual team goals and Organizational goals, There was 
a rich, almost overwhelming How of information. The 
sysinn learn appended I earn reports and producl status 
information to their weekly mini ties, which were dislrih- 
Uted widely SO everyone saw the results. 
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Tin' rest of this article will discuss the activities per- 
formed in the software development process to create the 

MCF kernel ^technical details of the kernel \KT can be 
found in i he article Ofi page & 

Quality Control Plan 

Historically, the reliability of HP-UX soil ware has been 
measured in terms of the following items: 

• Defect density (the number of defects pertHie thousand 
lines of noncornment source statements, ur KNCSS) 

• Functional test coverage (the number of external inter- 
fares tested and the branch flow coverage values) 

• Reliability under stress (continuous hours of operation, 
or CI KM, 

The MC'F team added the following measures: 

• Design and code reviews to ensure high software compo- 
nent quality before delivery to system integration and tesl 

• Weekly integration cycles wiih lull testing participation 
by the developmenl partners, which triehKted develop- 
meni teams outside • if the kernel Laborati >\y 

The weekly integration cycles uncovered a number of 
interaction and system defects rapidly and early. 

The program team designated schedule and quality as Lhe 
top two priorities of the MCF release. Another program 
Irani decision reduced (he functionality to the minimum 
core requirements, which In turn reduced the time to 

market. The program team also ehose To release only nut 1 
system initially (itn Model 720) rather than three (the 
Models 72(X 730, and 750) and EO sell the Model 720 in a 
stand-alone configuration only, rather than supporting it in 
a diskless elusier. 

These decisions resulted in reduced testing complexity, 
Tin lest setup times highlight the complexity reduction. 
For the MCF release, lhe lesi setup lime represented 
about 1% of lhe tot ill test lime For the subsequent 
releases (with Models 710, 720. 7H0. and 7$& participating, 
in both Stand-alone and diskless eonfjgLiraiioissj the lesi 
setup time rose in about 12%. The increase is significant 
since the lest setup time cannot be automated and 
represents valuable engineering time. 

Certification Process 

The program teams decision to limit functionality guaran- 
teed a better, mure stable system faster. There were 
fewer components in which delects occuned and ihere 
were lew ei interfaces where interaction problems arose. 

The system lest team was able to capitalize on both these 
benefits. A team decision was made to set a lower 
reliability goal of IK continuous hours of operation (CMO) 
under stress, instead of lhe traditional 96 CHD. This 

decision substantially reduced the number of system tesl 
cycles required. The system test team iiext decided to 
attempt the is CHD reliability goal in a single, four-week 
tesl cycle Previous IIP IX releases had required four tesl 
cycles, each ranging from two to six weeks. 

The single-test -eyrie model a benefit of reduced function^ 
ality, emphasized one of the key development goals: **Do 
it right the first lime" This goal was important, because 



the aggrcsshe MCF schedule did not permit the develop- 
men! learns any time for rework. 

In summaiy the MCF quality plan featured the following 
objectives: 

• A red net inn in configuration and testing complexity 

• A single lesi cycle 

• A48-CHO Software cert ifieal ion goal 

• The use of design and code reviews before delivering m-w 
functionality 

■ The use of traditional quality measurements before deliv- 
ery to system integration and tesl 

• Weekly integration cycles with full partner testing partici- 
pation 

• An early baseline established by ihe quality requirements 
of the VAB team activities. 

Design and Code Reviews 

The software engineers in the HP-IX kernel laboratory 
determined that the best way to achieve the MCF quality 
objectives was to focus on design and code reviews 
Engineers evaluated lhe effectiveness Of their existing 
review process to find defects before kernel integration 
and determined thai it was not adequate to mecl ihe MCF 
quality goals. This fed i<> a search for a new design and 
code review process. Several of the engineers had used a 
formal review process railed software inspection 1 on 
[in ■■■.inns projects, and felt that it would find key defects 
before kernel integration. 

The inspection process was used during the design phase 
with moderate success. A handful of the engineers had 
been previously trained on Ihe process. The resl of lhe 
engineers simply received a document that described the 
inspection process. There was no forma] training given on 
inspection roles, criteria, checklist, lime requirements, or 
meeting conduct. 

When lhe inspection meetings begum several of the 
first-level managers felt that the inspection process was 
nut as Successful as it could be. They heard complaints 
from the engineers aboul i h t • design documents, insuffi- 
cient preparation by the inspectors, rambling meetings, 
and the absence of time estimates in the MCF schedule 
to perform the process. 

The managers put the inspection process on hold and 
asked an inspect inn em i Milt am about the complaints thej 
had heard. The consultant gave guidance about the team 
member's roles, how inspectors should prepare for the 
meelings, what to focus on during the meetings, and the 
amount of time required when the process is operating 
properly 

The managers took this feedback back to the cngiin *>is 
so they could make changes. For example, the time 
estimate to do the inspections was added to the MCF 
schedule. This change showed the engineers that they had 
the opportunity to do inspections, and that the process 
was considered important. Performing Inspections also 
caused the MCF schedule to slip by two weeks. The 
program team made an adjustment elsewhere in the 
program to recover the two weeks 



12 



'"ii-' I"' 'J Hinder r K-u-kjinl ,lc ji trim 



)Copr. 1949-1998 Hewlett-Packard Co. 



12 



10 — 



OS 



Defects 
Haul 0£ 



m — 



02 



Q.D 




lillli 



E F MCFProjeci 



Other Projects Inside 
and Outside HP 

Fig. 1. MCF fcrtepectiofi efficient ccmtpared ]<> other projects in- 
side aiut outside HR 



Thc main benefit of using inspections was that import am 
defects were found early. The advance defect visibility 
minimized schedule delays by reducing the impact on 
Critical path activities. One defect uncovered during an 
inspection was later estimated lo require at legist two 
weeks lo isolate and repair if it had been found during 
system integration. 

Fig. 1 compares the MCF inspection efficiency with the 
results of other projects inside and outside IIP. MCF 
achieved the second best results. Although data was not 
collected on the number of defects found during product 
and system testing, the general feeling was that there 
were fewer defects found in comparison to Other IIP- 1 X 
releases. This feeling was continued when ihe MCF 
kemel took six weeks to achieve the 48 continuous hours 
of operation quality goal compared lo previous I IP- 1 X 
kernels which had taken at least eight weeks. 

Branch and Source Management 

The kerne] sources had been managed h> a sourer control 
m lhal penuitied multiple development branches lo l» 
open at any time. This penuitied different developeu m\ 
efforts to proceed Independently, When the time came in 
merge branch development into the main trunk, it was 
necessary to lock the branch, Branch locks ranged on the 
order of a few days to two weeks, depending on the 
number of changes and the stability of die resulting kernel. 
The delays Ernst rated the engineers who were waiting to 
include critical path functionality and important drferi fixes 

The basic YKT source management philosophy was; 
"Keep the branch open!" 71ms, locking the branch for two 
weeks was unacceptable 

Two branches were required to implement the aggressive 
MCF schedule: one to implement the new -IK-byto page 
M/.e, and die odier to implemeni software support for a 
nrw I/O backplane 

Both branches began from the same snapshot of the 
IIP I X 8,0 kernel sources. As work progressed, a snap 
shot of Ihe IK-byle page size leaiu's work was merged 



with the I/O teams branch. The merge was done in an 
ongoing, incremental fashion so that no big surprises 
would appear late in the release and the branch lock time 
would he niinirnized. 

The merge was accomplished by running a software tool 
that checked every line, in every fde, on both branches. If 
a file had no changes on either branch the original file 
was kept. If a file changed on one branch but not the 

ici. ihe change was incorporated. If a file changed on 
both branches it was flagged for an engineer to review 
and resolve manually. 

The MCF merge goal was to lock the branch and require 
engineering review for no more than 36 hours. The goal 
was consistently met because of the careful attention of 
the kernel branch administrator and Ihe high degree ot 
team cooperation when assistance was required 

Automated Nightly Build and Test 

What new testing challenges did the MCF release present? 
The key goal was to do a full kernel build and regression 
lesi cycle five nights a week, not just once a week as 
had been done in the past. Could we push rifce existing 
process this far? The kernel integration team was uncer- 
tain, but was confident that the minimum core fund tonal- 
ity model could be capitalized on. 

Regression testing revisits the software that has already 
been tested by the development team. What did die 
kernel integration team expect to gain from redundant 
testing'. 1 First, to observe, characterize, and resolve any 
problems detected in the nightly kernel build. Second, at 
least to match the test results from the previous build, 
fl iih the goa] of converging lo zero (est failures rapidly, 

The MCF regression test plan featured die following; 
■ A Ti-si sHup process that was bootstrapped 

• Automated software that ran ihe regression lests live 
nights a week 

• An emphasis placed on parallel operation and the reliable 
presence of test results 

• Automated sol! ware that updated the test machines with the 
system integration teams gtxxi system on a weekly basis. 

Ilir regression feesfe for kernel integration included the 
following: 

• File system tests: hierarchical, distributed, networked, 
and t 1) -KuM 

• Kernel functional (esis 

• Disk quota I unci if mal tests 

a Database send and reeerve functional lests. 

The software developers created their own tests to cover 
new functionality (e.g.. SCSI. Centronics, and digital tape 
inn "traces), These tests were run by the development 
teams directly. 

The Test Setup Process 

M first test machines were scarce because there were 
only a handful of hardware prototypes available &o the 
MCF team. Therefore, regression testing began on a 
standby basis. Eventually, one hardware prototype became 
Available for use on weeknights. This allowed the test 
Setup process to begin in earnest 
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Fig. 2. TJil' M(T redundancy testing ■ ■• i nj ■ 

The least-demanding tests such as the distributed and 
hierarchies] file system tests were installed and run fir. si 
After any kernel or lest problems were observed, charac- 
terized, and resolved by analyzing the results, l ho setup 
for the more diffieull lests lor areas such as the kernel, 
the network, and the CD-BOM file system began. 



Automatic Testing 



Software was developed to automate running the regres- 
sion tests nightly. The soil ware was designed to perform 
the following tasks: 

• Put i he kernel program and include files on the test sys~ 
terns 

• Reboot the lest systems 

• Stall I lie tests 

• Mail I he test results directly to Hie kernel integration 
1 1 *am» 

At one point, the kernel began to panic and the regres- 
sion tests were prematurely interrupted This caused a 
problem in receiving a complete set of test results. 
Fortunately, by this time, twi> functional prototype ma 
chinos w r ero available for nightly regression testing (see 
Fig. 2). The solution was to have both machines run the 
regression tests each night, hm in reverse order The first 
machine ran the easier file system tests first, followed by 
the more demanding kernel functional and remaining 
tests. The second system ran the same test groups, but in 
reverse order. The "redundant but reverse order" solution 
ensured the presence of a full set of test results eaeh 
morning by combining the output of both systems it 
required. 

Once all the test groups were set up and running, it 
proved impossible for the automated soft wart 1 to com- 
plete them within the six-hour time limit The problem 
was solved by modifying the auumialed software to start 
as many of the test groups as possible in parallel The 
plan was to capitalize on the 11P-1 T X process schedulum 
abilities and maximize the throughput. One assumption 
was made using this approach — the tests would not 
adversely interact with each oilier. The assumption 
f Moved to be I rue in general. The exceptions were the 
disk quota, CD BOM, and system accounting tests, which 
had conflicts. The automated software was modified to 
serialize the execution of the disk quota and (I)- ROM test 



groups and run thorn as a separate Stream in parallel wilh 
l he other lest groups. The test administrator chose hi 
handle the system accounting test manually which 
continued to fail occasionally because of known conflicts. 

Weekly Delivery to System Integration 

A complete internally consistent system was built every 
week, allowing up to dan software with the lalesl fixes to 
be used by I he development partners for system integra- 
tion. To deliver the new system to system integration, the 
kernel build administrator examined the logs, handled any 
exceptional conditions, communicated with partners, and 
then wrote, for review by the management teams, a 
repent that explained what changed from week to week. 

On Monday mornings, before the kernel build administra- 
tor had arrived to check the logs, the kernel program and 
the include files were automatically sent from Cupertino, 
California to the HP-UX commands team in Fori Collins. 
Colorado. After delivery, the HP-UX commands, which 
required ihe include files, began automatically building 
the system. If the kernel build administrator detected a 
problem in the error lugs, the commands build adminis- 
trator was called. The two administrators consulted over 
the telephone whether to let the commands build com- 
plete, or to interrupt it. Often, if there was a problem, Hi< 
kernel delivery was still useful. For example, it was only 
necessary to interrupt ihe commands build two or three 
times on I of twenty or more kernel deliveries. 

In summary, the weekly delivery process offered the 
f o 1 1 o w i n g features: 

• Files were delivered in advance, before the tests had cer 
tified I hem 

• Rapid team communication was used lo notify the part- 
ners depending on the delivery if any problem was de- 
tected 

■ Systems delivered were often usable by the partners even 
when problems were detected 

• Problems, status, and any changes were communicated 
quick!) ami directly 

Conclusion 

The HP-UX kernel laboratory produced a version of the 
HP-UX operating sysiem that achieved excellent perl i 
manee and rapid lime to market for a new workstation 
computer, ihe HP 0000 Model 720, This achievement was 
made possible by a simplified management structure, the 
specification of minimum core functionality, a quality 
control plan that used design and code reviews, and a 
kernel integration process thai featured full automation 
of the Software build. tesL and delivery activities. 
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New Optimizations for PA-RISC 
Compilers 

Extensions to the PA-RISC architecture exposed opportunities for code 
optimizations that enable compilers to produce code that significantly 
boosts the performance of applications running on PA-RISC machines. 

by Robert C. Hansen 



Hewlett-Packard's involvement in reduced instruction set 
computers (RISC) began in the early 1980s when a group 
was formed to develop a computer architecture powerful 
and versatile enough to excel in all of Hewlett -Packard's 
markets including commercial, engineering, scientific, and 
manufacturing. The designers on this team possessed 
unusually diverse experience and training. They included 
compiler designers, operating system designers, micro- 
coders, performance analysts, hardware designers, ami 
system architects, The intent was to bring together 
different perspectives, so that the team could deal effec- 
tively with design trade-offs that cross the traditional 
boundaries between disciplines. After 18 months of 
iterative, measurement-oriented evaluation of what com- 
puters do during application execution, the group pro 
duced an architecture definition known today as Precision 
Architecture RISC, or PA-RISC. ^ 

In the late 1980s, there were a number of groups I hat 
were looking for ways to make Hewlett-Packard more 
successful in I he highly competitive workstation markei. 
These groups realized the need for better floating-point 
performance and virlual memory management in PA-EIS 
to produce a low -cost, high -performance. PA-RISC based 
workstation product, Experts from these groups and other 
areas were brought together to collaborate on their ideas 
and to propose a set of expansions lo PA-RISC Many 
members of this team were from the hhen newly acquired 
Apollo Computers (now HP Apollo I With the knowledge 
gained from years of experience with PA-RISC and PRISM 
from Apollo Computers, suggested extensions to the 
architecture were heavily scrutinized and only accepted 
after inch benefits could be validated The result was a 
small but significant set of extensions bo PA- RISC now 
known as PA-RISC 1.1. 

Although not a rigid rule, most of the architecture exten- 
sions of PA-RISC 1.1 were directed at improving Hewlett- 
Packard's position in the technical workstation market. 
Many of the extensions aimed al improving application 
performance required strong support in the optimizer 
portion of the PA-RISC compilers. Key technical engineers 
were reassigned lo increase the si all of what had pre- 
viously been a small optimizer learn in HP's California 
Language Laboratory. In addition, engineers responsible 
for compiler front ends became involved wilh supporting 
new optimization and compatibility options for the two 
versions of the ar< hiiciime. Finally many compiler 



members from die IIP Apollo group shared their insights 
on how to improve the overall code generation of the 
PA-RISC LI compilers. The PA-RISC 1.1 extensions, 
together with enhancements to the optimizing compilers, 
have enabled Hewlett-Packard to build a low-cost high- 
performance desktop workstation with industry-leading 
performance. 

The first release of ihe PA-RISC 1.1 architecture is found 
in the HP 9000 Series 700 workstations naming version 
8.05 of the HP-UX* operating system (HP-UX 8.05). The 
operating system and the compilers for the Series 700 
workstation are based on the HP-UX 8.0 operating system, 
which runs on uhe IIP 9000 Series 800 machines, 

This article presents a brief disci jssion about the architec- 
ture extensions, followed by an overview of the enhance- 
ments made to the compilers to exploit these extensions. 
In addition lo enhancements made lo the compilers In 
support architecture extensions, there were a number of 
enhancements to traditional optimizations performed by 
the compilers thai improve application performance, 
independent of I be underlying architecture. These generic 
enhancements will also be covered. Pi rial ly performance 
data and an analysis will be presented. 

PA-RISC 1,1 Architecture Overview 

Most of the extensions lo PA-RISC were motivated by 
technical wnrkslation requirements and w r ere designed to 
improve performance in Ihe areas of virtual memory 
management numerical applications, and graphics, all at 
the lowest, possible cost. Most of the architecture exten- 
sions can he exploited by I he compilers available on 
PA-RISC LI implementations. Additional implementation 
specific extensions, like special instructions, have been 
made to improve performance in critical regions of 
system code and will not be discussed here* 

New Instructions 

Most implementations of PA -RISC employ a floating-point 
assist coprocessor to support high-performance numeric 
processing,- It is common for a floating-point coprocessor 
to contain al least I wo functional units: one lhal performs 
addition and subtraction operations and one lhal performs 
multiplication and other operations. These two functional 
units ran acre pi and process data in parallel. To dispatch 
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Fig. 1. Legal cLii<i illegal uses of ihe five -npe rand FMPYADD instruc- 
tion. Because of parallelism the multiply and arid operations 

Hi-- ,-iT I In 1 H,4tiir finir. (}i) Lf/ga] USt J nf III- irr^ruiri, Hi T\u-\> i- 

no interdependence betwe n operands, (hi Illegal use of tin 
instruction. The operand in Soaiing-potal renter Gr3 is used in 

both operations, 

operations to these functional units at a higher rate, two 

five-operand floating-point instructions were added to the 

instruction sei: 

FMPYADD: Float ing-poinl multiply and add 

FMPVSUB: Float ing-poinl rmihiply ;ind subtract. 

In a single instruction, the compiler can specify a float- 
ing-point multiplication operation (two source registers 
and one target) together with an independent floating- 
point addition or subtract ion operation in which one 
register is both a source and a target. However, because 
the multiply operation is executed in parallel with the add 
or subtract operation in a five-operand instruction, the 
result of one operation cannot be used in the paired 
operation. For example, in an FMPYADD. the product of the 
nuilnpliraiidii caimot be used as a source for the addition 
and vice versa (see Fig. 1). 

Since most floating-point multipliers can also perform 
fixed-point multiplication operations, the unsigned integer 
multiplication instruction XMPYU was also defined in 
PA-RISC 1.1. XMPYU operates only on registers in the 
floating-point register file described below. This dependen- 
cy implies that fixed-point operands may haw io be 
moved from general registers to floatingpoint registers 



and lite producl moved hark to a general-purpose regis- 
ter. Since I here is no architected support for moving 
quantities between general -purpose and floating-point 
register banks directly, this movement is done through 
stores and loads from memory. The compiler decides 
when il is beneficial to use the XMPYU instruction instead 
of the sophisticated multiplication and division technique 
provided in FA -RISC'* Signed integer multiplication can 
also be accomplished using the XMPYU instruction in 
conjunction with the appropriate extract (EXTRS, EXTRU) 
instructions. 

Additional Floating-Point Registers 

To increase the performance for floating-point -intensive 
code, the PA-RISC 1.1 floating-point register file has been 
extended. The number of 64-bit (double-precision) registers 
has been doubled from 16 to :J2 (see Fig. 2)> 

In addition, both halves of each 64-bit register can now 
be addressed as a 32-hii {single-precision) rcgisler, giving 
a total of 64 single-precision registers compared lo only 
16 for PA-RISC LO. Moreover, contiguous pairs of single- 
precision values can be loaded or stored using a single 
double-word load or store instruction. Using a double- 
word load instruction to load two single-precision quanti- 
ties can be useful when manipulating single-precision 
arrays and FORTRAN complex data items. 

Cache Hints 

On PA-RISC systems, instructions and data are typically 
fetched and sloretl to memory through a small, high-speed 
memory known as a cache. A cache shortens virtual 
memory access times by keeping copies of the most 
recently accessed items within its last memory. The cache 
is divided into blocks of data and each block has an 
address lag that corresponds lo a block of memory When 
the processor accesses an instruction or data, the item is 
fetched from the appropriate cache block, saving signifi- 
cant time in not having to fetch il from the larger 
memory system. If the item is not in the cache, a cache 
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Fig. 2. The floating-point register file contains 2$ 64-bit ciata 
tersaiw ■ ■. ■ n 'i: bit regiyirrs fer reporting exceptional conditions 
Thestatus ■'•!- tnfi amataoai Effi thi eu?rent rounding 
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IEEE exceptions; overflow, underflow, divide hy writ, invalid 

itioivand tnesact U an -->:< eption is raised when traps are 
inaNed. an interrupt to IE c DCCttrs, with the es- 

• ii .in I the Instruction i musing it records! in an exception 
register. 
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miss occurs and the processor may staJl until the needed 
block of memory is brought into the cache* 

To increase cache throughput, an extension to cache 
management has been exposed to the compiler. A bit has 
been encoded in the - ructions that ran be used 

when the compiler knows thai each element in a cache 
block will be overwritten. This hint indicates to the 
hardware thai it is not necessary DO fetch the contents of 
that cache block from memory in the event of a cache 
miss. This cache hint could be used to provide a signifi- 
cant savings when copying large data structures or 
initializing pages. 

Optimization Enhancements 

Optimizing compilers make an important contribution to 
application performance on PA-RISC processors/^' \ 
single shared optimizer back end 1 is used in most PA- 
RLSO compilers. When global optimization is enabled, the 
following i radii i on al Transformation phases take place: 7 

* Global data How and alias analysis (knowing which daut 
items are accessed by code and which data items may 
overlap is the foundation for many phases that follow) 

* Constant propagation ( folding and substitution of 
constant computations ) 

* Loop invariant code motion (computations within a loop 
that yield the same result for every iteration) 

* Strength reduction (replacing multiplication operations 
inside a loop with iterative addition operations) 

* Redundant load elimination (elimination of loads when 
the current value is already contained in a register i 

* Register promotion (promotion of a data item held In 
memory to being held in a register) 

* ((until on subexpression Wimmation (removal of redun- 
dant Computations and the reuse of (he one result) 

* Peephole opl imitations (use uf;i dii -Tionarv of equivalent 

instruction patterns to simplify instruction sequences) 

* Dead code elimination (removal of code that will not 
execute) 

* B ran ( " h opt i m i Z\ i 1 i o n s < . t rans f ormat io n o i h m r i e h i u s I ruc- 
lion sequences into more efficient .instruct ion sequences) 

* Branch delay slot scheduling (reordering instructions to 
perform computations in parallel with a branch : 

* Graph coloring register allocation (use of a technique 
called graph coloring to optimize the use of machine 
registers.! 

* Instruction scheduling (reordering instructions within a 
basic block to minimize pipeline interlocks) 

* Live variable analysis (removing instructions that compute 
values thai are not needed } 

With PA-RISC 1.1, a number of these areas were en- 
hanced to take advantage of the extensions to the archi- 
tecture. Specifically, die last two transformations, register 
allocation and instinct ion scheduling, saw many changes 
1" support the extended floating-point registers and new 
five-operand instructions. 

In addition to the enhancements made to support the 
architecture extensions, the compiler optimization team 

* I lip FORTRAN compiler alsu uses an DpttmFling pfJ trif Ironi end that performs 
acme language depends DtKimaations nefore sending thfl code tci the standard FORTRAN 
Com] • fared optimizer back end [see article on page 24) 



spent a considerable amount of time analyzing application 
unle to identify missed optimization opportunities. There 
was also a very' thorough evaluation of Hewlett-Packard s 
optimizing compilers to see how they matched some key 
workstation competitors' compilers. Many architecture 
independent enhancements were identified and added at 
The same time as the PA-RJ3( LI enhancenv 

These compiler enhancements were integrated with the 
HP-UX 8.0 compilers available on the HP 9000 Series 
machines. Because the same base was used, the architec- 
ture independent optimization enhancements added to the 
compilers will also benefit code compiled with the HP-fX 
8.0 compilers. 

Many of the enhancements to the optimizing compilers 
led to significant improvements in the Systems Perfor- 
mance Evaluation Cooperative iSI'Kf i benchmark suite. 
Release 1.2b of rhc SPEC suite contains 10 benchmarks 
thai primarily measure CPU (integer and floating-point) 
performance in the engineering and scientific fields. 
Performance data for the SPEC benchmarks is presented 
later in this article. 

Improved Register Allocation 

Near-optimal use of Hie available hardware registers is 
crucial to application performance. Many optimization 
phases introduce temporary variables or prolong the use 
of existing register variables over larger portions of a 
procedure. The PA-RISC optimizer uses an interference 
graph coloring technique** to allocate registers to a 
procedure's data items. When the coloring register alloca- 
tor runs out of free registers, it is forced to save or 
"spill" a register to memory Spilling a register implies 
th&l all instructions that access I lie item that was spitted 
must first reload the item into a temporary register, and 
any new definitions of the item are immediately stored 
back to memory Spilling can have a costly impact on 
runtime pei Immuncc, 

With PA-RISC 1.1, the register allocator was enhanced to 
support the additional floating point registers. These 
additional float ing-poinl registers have greatly decreased 
the amount of floating-point spill code in floating-poinf- 
uijerisive applications. The register allocator now has 
more than twice the number of 64-hil (double-precision) 
floating-point registers available for allocation purposes 
(see Pig. S). Also, the PA-RISC LI architecture now 
allows either half of a (54-bit register to he used as a 
:':Mm (single-precision) register, resulting in more than 
four tunes the number of single-precision registers that are 
available in PA-RISC \Xl 

Improved Instruction Scheduling 

The instruction scheduler is responsible for reordering the 
machine- level instructions within si might-line code to 
minimise stalls in I he processors pipeline and to Take 
advantage of the parallelism between the CPU and the 
final ing-point coprocessor It is also responsible for 
attempting to fill pipeline delay slois of branch Instruc- 
tions with a nsei'ul instruction, Qf:C0tt$$e, the Instruction 
scheduler musi maintain the correctness of the program 
when it reorders ioslnielions. The instruction scheduling 
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algorithm used in the PA-K1SC compilers is based on the 
technique described in reference 9. 

Until recently, instruction scheduling was clone just once 
after register allocation, immediately before the machine 
instructions are wriUcn to I he object file. This otic pass 
approach suffered became the register allocator may 
allocate registers to data items in a manner thai imposes 
artificial dependencies These artificial dependencies can 
restrict the scheduler from moving instructions around to 
avoid interlocks (i.e., pipeline stalls). 

For example, the fDW (load word) Instruction on PA-Rlsr 
typically takes two cycles to complete. This means that if 
the very next instruction following the LDW uses the 
target register or the LDW, the processor will stall lor one 
cycle (load-use interlock) until the load completes, The 
instruction scheduler is responsible for reordering Instruc- 
tions to minimize these stalls. The software pipelining 
artiele on page 30 describes pipeline stalls in more detail. 

II" the register allocator allocates the same register to two 

independent data items, this might Impair the reordering 
operations of the Instruction scheduler. For example, if 
register allocation results iil I he following eode: 

LDW (0, %r30). %r20 ;!oad some data rrito register 20 

ADD %r20, %r21, %r22 ;use register 20 

LDW 8 |Q, %r3Q) J %r?0 ;load some other data into register 20 

ADD %r20, ... ;use register 20 

the scheduler cannot move any of the Instructions up- 
wards or downwards to prevent load-use interlocks 
because of the dependencies on register 20. This could 
lead to a situation in which no useful instruction can be 
placed between the LDW and the Instructions that use 
register :!(). 

These artificial dependencies imposed by the register 
al locator could also limit the instruction scheduler's 
ability to interleave general register instructions with 
floating-point instructions, Interleaving is crucial in 
keeping both the general CFU and the floating-point 
coprocessor busy and exploiting a limited amount of 
parallelism. 

To improve the effectiveness of instruct ion scheduling, 
both the PA-RISC 1.0 and 1.1 compilers now perform 
Instruction scheduling twice, onre before register alloca- 
tion and once after. By scheduling before register alloca- 
tion, the scheduler can now delect a greater amount of 
instruction-level parallelism within the code and thus have 
greater freedom in reordering the instructions, Scheduling 
aftei register allocation enables the scheduler to reorder 
instructions in regions where the register allocation may 
have deleted or added instructions (i.e., spill code 
instructions)- 

The Instruction scheduler's dependency analysis capabili- 
ties have also been improved to recognise many of the 
CBSes where indexed loads and stores are accessing 
distinct elements of the same array. Through more 
accurate information, the scheduler has greater freedom 
to safely move loads ol some array elements ahead of 
stores to other elements of that array. 



Another improvement made to help the scheduler when it 
is ordering code involved tuning the heuristics used to 
rake into account some of the unique features of imple- 
mentations of PA-RISC LI. These heuristics are aimed ai 
avoiding cache stalls (stores immediately followed by 
loads or other stores), and modeling the floating-point 
latencies Of the ttW IW-RISC l.l implementation moo- 

closely. 

Finally, the instruction scheduler has also been enhanced 
to identify eases in winch the new Bye-Operand rnslruc- 
r jr n is available in 3 J \ -RISC l.l ran be formed. The sched- 
uler, running before noisier allocation, identifies floating- 
point multiplication (FMPY) instructions and independent 
floating-point addition (FADD) or subtraction f'FSUB) 
instructions that can be combined to form a single 
five-operand FMPVADD or FMPYSUB instruction. 

When five-operand instructions are formulated during the 
scheduler pass, global data flow information is used to 
ensure thai one of the registers used as an operand of 
the FADD or FSUB can be used to hold the Jesuit of the 
FADD or FSUB. This will be true if the data How informa- 
tion show's that the register containing the original 
operand has no further use in the instructions thai follow 
For example, in the Inst nut ion sequence: 

FMPYfrUrl fr3 :fr3 - fr1 * fr2 



FADD fr4 H 1Y5, fr6 M = fr4 + frS 

if Erg lias no further uses in die instruct ions i hat follow 
the fadd, it can be usi -d tu replace register Qr6 as the 
resull ol the addition. Any instructions that follow the 
FADD that use the result of the addition would be modi- 
fied to use register fr5 instead of register 1'rG. 

Another problem confronting the instruction scheduler is 
instinct ions that occur between tw r o instructions targeted 
to be joined. For example, take a simple ease of using fr3 
between the FMPY and the FADD: 



FMPY frl, fr2 r fr3 
FSTDS fr3, memory 

FADD fr4, fr5, frB 



P fr3 = frl * fr2 4 frS = fr4 + fr5 
; store fr3 in memory 

; fr6 = fr4 + frS 



The five-operand FMPYADD cannot be placed in the posi- 
tion or The FADD wit hunt moving the use of frfl below the 
new five-operand instruction because the wrong value 
may be stored in memory. When the scheduler is satisfied 
lhat Ihe necessary criteria have been met, it will produce 
the five-operand instruction: 



FMPYADD frl, fr2 f fr3, fr4 P fr5 ; fr3 = frl * fr2 f fr5 = fr4 + fr5 
FSTDS fr3, memory 

where register fr-"> serves as both an operand and the 
result of the addition. 

These five-operand instructions allow the compiler io 
reduce significantly the number of instructions generated 
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for some applications. In addition, they allow the floating- 
point coprocessor to dispatch two operations in a single 
cycle. 

Software Pipelining 

Software pipelining is an advanced transformation Chat 

attempts to interleave instructions from multiple iterations 
of a program loop to produce a greater amount of 
instruction-level parallelism and minimize pipeline stalls. 

I t ware pipelining Ls a new component thai has been 
added to the global optimizer for both PA- RISC LO and 

RISC II. The article on page 38 provides more 
information on this loop transformation scheme. 

Register Reassociation 

Register reassociation is a code improving transformation 
that supplements loop-in variani code motion and strength 
reduction The main objective of ibis optimization is to 
eliminate integer arithmetic operations found in loops. It 
is particularly useful when applied to multidimensional 
array address computations. The article on page 33 pro- 
vides more information on this transformation technique, 

Linker Optimizations 

The AODIL instruction is used by the compiler's in conjunc- 
tion wilh load or store instructions to generate the virtual 
addresses of global or static data items. The compiler 
must produce these large addresses because the compiler 
has no knowledge of where the desired data will be 
mapped with regards to the global data base register. The 
ADDIL instruction is unnecessary if the short displacement 
field in I he load or store instruction is adequate to 
specify ibe offsel of the data from I he base register. The 

actual displacemenl of data items is finalized at link lime. 
The PA-RISC Lti and l.l compilers now arrange for small 
global variables to be allocated as close to the base of 
the data area as possible. Tin UP l\ linker has been 
en ban rt'd In remove any unnecessary ADDIL instructions 
when the short displacement Held in toad and store 

instructions is found to be adequate. 

As optimization strategies become inure sophisticated, the 
use of run-lime profiling data call be vht useful in 
guiding various transformations. In the first of many 
stages to come, the PA-RISC optimizing linker now uses 
profiling information to reorder I he procedures of an 
application to reduce cache com en lion and lo minimize 
(he number of dynamic long branches needed to transfer 
control between heavily called routines. This repositioning 
technique is currently known as profile-based procedure 
repositioning. 
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These two linker-based optimizations are enabled through 
special compiler and linker options, See "Link -Time 
1 >i limitations" on page 22 for more information on these 
two transforms! ions 

FORTRAN Vectorizing Preprocessor 

The Hewlett Packard FoicTUAN optimizing preprocessor 

is a major addition to the FORTRAN compiler for the HP 
90100 Series 700 implementation of PA RISC LI. The 
preprocessor was a joint effort of Hewlett-Packard and an 

OUtSide vendor. Using advanced program and data How 

analysis techniques, and wiih specific del nils covering fcfee 



implementation of the underlying architecture. FORTRAN 
source code is transformed to be more efficient and to 
lake advantage of a highly tuned vector library. The 
preprocessor has boosted benchmark performance and 
real customer application performance by as much as 

ties 700 FORTRAN optimizing preprocessor is 
described in the article on page _ 

Compatibility 

An important design goal in evolving the architeetm 
RA last l.l was to allow a smooth transition irom 
existing PA-RISC LC implementalions. With the exception 
of FORTRAN, the compilers on the Series 7t)0 imple- 
mentation of PA-RISC I.] are based on the compilers 
used in the existing Series son implementations of PA- 
RISC' 1.0. Because ibe same compilers are used on Series 
SOU and 700 systems, maximum portability of source code 
is achieved. 

Another system design goal was b> provide soft ware 
compatibility at the source code level with the UP JIOOO 
Series 300 and Series 400 workstations, which are based 
on the Motorola MC680xO architecture. 

Special efforts have been made for the C and FORTRAN 
languages to provide this compatibility The PA-I;lSf c 
compiler has been enhanced with compiler directives to 
provide Series 300 and 100 compatible data alignment, 
whirh is the one area ur potential incompatibility with 
PA RISC. In (lie cage Of K< HRT RAN, a larger number of 
compatibility issues exist. The first release Of System 
software included a version of the FORTRAN compiler 
from the Series SI JO and a separate version from the 
Series 300 and 100 workstations. The latest releases now 
contain a single FORTRAN compiler based on the Series 
.'JIM I and 100 workstation compiler I hat has ail option lhal 
allows users to compile rlnii FORTRAN applications with 
semantics identical lo either ibe Series SOU compiler *»? 
the Series 3C$ compiler 

Given that the PA-RISC 1,1 architecture is a striei super- 
set of PA-RISC L0, ;il] iJP-t'X objeel rode is eompletely 
forward compatible from PA-RISC 1.0 based implementa- 
tions lo the new PA-RISC 1.1 workstations. Portability 
includes object modules, libraries, and relocatable pro- 
grams. Programs compiled and Linked on PA-RISC III 
implementations can run unchanged on PA-RJSC CI 
implementations, and any combination of ohjeel modules 
and libraries From I he two systems can be linked togeth- 
er. Rec< mipilai ton is necessary only if the programmer 
wishes lo rake advantage of Ihe archilccture and opti- 
mization enhancements, this forward compatibility of 

object modules allowed many vendors lo port tlteir 
products to the Series 700 with little or no effort 

Although dbjed tiles can be ported from PA-RLSC CO 
implementations n> PA-RISC 1.1 implementations, the 
reverse may nol always be I rue if Ibe object file on a 
PA RISC II machine was generated by a compiler thai 

exploits the extensions lo the PA-RISC 11 architecture. 
The HP-UX loader detects such situations and refuses to 
execute a program on a P\ Rise i.o iinplernentation tiiai 
has been compiled with PA- RISC 1.1 extensions To assist 
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tin user in generating the most portable object Hies, a 
compiler option has been added to specify the destination 
architecture (DA) for the rode generated by the compiler. 
For example. 

% cc +DA1.0 my_prog.c 

would generate an object file based or* the FA-RISC 1.0 
architecture definition. The object file could also be 
polled directly to a PA-RLSC 1.1 implementation withoul 
the need for re compilation. A user ran also use this 
option explicitly to ask tor PA -RISC !,] extensions, or for 
cross compiling while on a PA-RISC 1 L0 implementation 
wilh the command-line sequence: 

% cc +DA1.1 my_prcg.c 

Of course, the object file produced could no longer be 
executed on a PA-RISC 1.0 implementation. 

If the destination architecture is not specified, the default 
for the compilers is to generate code based on the 
architecture implementation on which the compiler is 
executing. 

Performance 

Through a combination of clock rale, instruction >H 
extensions, compiler optimization enhancements, and 
processor implementation, the HP 9000 Series 700 
workstations are currently producing industry leading 
1 jerfo nuance. Although much of this performance im- 
provement comes from an increase in clock rate, as seen 
in the tables below, the compilers play a significant role 
in increasing the overall performance. 

Table I compares the raw speed of the HP 9000 Series 
720 workstation based on PA-RISC 1.1 architecture with 
the IIP 9000 Series 835 workstation based on PA-RISC 
1.0, The SPEC benchmarks for the Series &J5 were 
compiled with the HP-UX 7,0 compilers using full opti- 
mization, For the Series 720, the HP-UX 8.07 compilers 
containing the latest enhancements to support PA-RISC 
1.1 w T ere used. For the SPEC benchmark suite, higher 
SPECmarks imply higher throughput. 

Table I 

Performance Comparison of PA-RISC Implementations 

Processor/ Clock Cache SPECmarks 

Implemen MHz Size in 

tation Kbytes Integer Float Overall 



Model 835/ 
PA-RISC 1.0 

Model 720/ 
PA-R1SCL1 



15 



-n 



128/ 
128* 

25CV 



9.7 



39.5 



9.1 



M 



(KM 



•Instruction Cache/Data Cache 

The data in Table n compares the relative efficiency of 
the HP 9000 Series 835 and the HP 9000 Series 720 by 
normalizing the benchmark performance. To normalize the 



numbers, the performance numbers are divided by the 
clods frctjuency. The normalized SPECmark performance 
of I he Model 720 is 92% higher than the normalized 
performance of the Model 835, Floating-point perfor- 
mance, which is 102% higher, is primarily because of the 
optimizing preprocessor, better compiler opi imizations, 
architecture extensions, implementation of separate 
floating-point multiplication and arithmetic functional 
units, faster floating-point operations, and larger caches 
The gains in the integer SPEC benchmark (22%) are 
primarily because of enhancements to traditional opti- 
mizations that are architecture independent. 

Tabled 

Normalized Performance Comparison of 

PA-RISC Implementations 

Processor/ Clock Normalized Improvement over 

Implemen- MHz Performance Series 835 

Inte- Float Over- Inte- Float ver- 
ger all ger all 

Model 15 tU>5 0,01 O.f.ii.i l.Ut) 1 .00 1.00 

835/ 

PA-RISC 

L0 

Model 50 0.79 1.60 L21 1.22 2,62 1.92 

720/ 

PA-RISC 

11 



To see exactly how much perfonnanee was gained 
through enhancements to the traditional compiler opti- 
mizations (not architecture-specific), we compiled the 
SPEC benchmarks using the HP-UX 8.07 compilers with 
level 2 Optimization and the destination architecture 
PA-RISC L0, This disables the use of the added instruc 
tions and floating-point registers. We also disabled use of 
the FORTRAN optimizing preprocessor. Table III shows 
how the HP-UX 7.0 SPEC benchmarks compare to the 
HP-UX 8.05 benchmarks while running on an HP 9000 
Model 720, 

From Table III, we can see that (lie enhancements made 
to the traditional compiler optimizations performed by the 
compilers produced gains of 1 to 24 percent. 

It is also interesting to see how much the architecture 
itself contributed to performance improvement. To do 
this, we used the same HP-UX 8.05 compilers {with the 
-0 option, which indicates to compile without the FOR- 
TRAN optimizing preprocessor* to produce SPEC bench- 
marks compiled for PA-RISC LO and PA-RISC LL Table 
IV shows thai all floating-point benchmarks except Spice 
show a significant improvement. This improvement comes 
directly from the larger register file and the added 
instructions in the PA-RISC LI instruction sfct The 
integer SPEC benchmarks are absent from this table 
because the architecture enhancements do little for 
integer code. 
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Table III 

Comparison between Benchmarks Compiled With 

HP-UX 7 and HP-UX 8 07 Compilers Running on an HP 9000 

Model 720 Workstation 



Benchmarks 

Oat gccT 35 

008. espresso 

022.lt 

023eqntott 

012.spice2g6 

OlB.doduc 

02G.nasa7 

G3Q.matnx300 

(R2.fpppp 

047 . torn catv 

SPECim 

SPECfp 

SPECmark 



HP-UX HP-UX Compiler 

7 8.05 Improvement 

1.12 

1.17 

LOl 

1.12 

I :M 
1.18 
1,15 
1. 05 
1.14 
L05 
1.10 

urn 

1.12 



37.4 




38.1 


38.6 


36.8 


41.2 


37.1 


*&2 


37.6 


n.2 


36.8 


42.3 




27.1) 


52.5 


59.6 


:is.s 


40.8 


35.8 


39.5 


37.3 


42.3 


3&7 


41.1 



Table IV 
Performance Improvement Resulting from Architecture 
Enhancements on the HP 9000 Model 720 Workstation 

Benchmark PA-RISC 1,0 PA-RISC 1.1 Architecture 

Improvement 

0l2.spice2g6 Jo.:! 4o\l 1.00 

GlS.doduc -44.2 50,6 1.14 

O20.nasa7 &g H.2 1.04 

O3O,matnx3O0 27,0 28.2 l.iJ-1 

(H2.fpppp §9.6 82.8 1.39 

D47.tomcatv 40.8 $QA 1.24 

SPECfp 42,3 47.8 1.13 

SPECmark n.l 44.4 1.08 

Finally, we wanted to see how much the optimizing 
preprocessor contributed to tfee SPEC benchmark im- 

provcment. To do this, we used the II [MX s.ur, K< }\{- 
TRAN compiler to produce two sets Bf the FORTRAN 
SPEC benchmarks. Both sets were compiled wiih full 
opt nazal ion but only one was compiled with lull opl imita- 
tion and the addition of Die prepro* rsson While bench- 
marks nasa7 and tomcatv showed fairly large improvements 
with the* Optimizing preprocessor, the gains for matnx3QC 

wimc dramatic All these benchmarks are known to suffer 
from cache and TLB (translation lookaside buffer) miss 
penalties, but the preprocessor was able to improve I heir 
performance through its memory hierarchy optimizations. 
Table V shows a comparison between the benchmarks 
created on an 111* I X s.o.~> operating system running on an 
IIP 4000 Model 72(J and compiled with and without the 



optimizing preprocessor enhancements. Excluded ben 
marks showed little or no gain. See the article on page 24 
for more about the FORTRAN optimizing preprocessor. 

Table V 

Performance Gains With the 
FORTRTAN Optimizing Preprocessor 

Benchmarks Without With Improvement 

Preprocessor Preprocessor 

OZOnasa? 44.2 1.42 



030 matnx300 

O47.tomcatv 

SPECfp 

SPECmark 



2S.2 
50.4 
47.8 
44.4 



320.9 

67.1 
80.0 
60.4 



11.38 
1.33 
1.67 



Conclusions 

To remain competitive in the workstation market, the 
PA-RISC architecture has been extended to belter meet 
the performance demands of workstation applications, 
With these changes to the architecture, Hewlett-Packard's 
compiler products have evolved to exploit the extensions 
made, Most important, the compilers successfully exploit 
I In increase in the number of floating-point register files 
and the new instructions including the integer multiply 
and the five-ope rant! Instructions. 

Besicfes being enhanced to exploit those new architectural 
features, additional rode improving transformations have 
been introduced dial are independent of the underlying 
architecture and substantially boost the performance of 
applications. These include a new vectorizing preproces- 
sor for FORTRAN* software pipelining, register reassoci- 
; it ion, link-time optimizations, and better instruction 
scheduling. The combined result of the architecture exten- 
sions, compiler enhancements, and a high-speed CMOS 
processor tmplementatiorj is a workstation syste&i ihai 
compares favorably with ihe most advanced workstations 
presently available. 
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Link-Time Optimizations 



There are some optimizations that can tie performed only when the linker produces 
an executable file. For the PA-flISC systems these optimizations include removing 
unnecessary instructions by changing the location of certain data segments, and 
locating procedures that call each other frequently close together 

Elimination of Unnecessary ADDIL Instructions 

Compilers generally do not know whether their data will be close to the base 

register for the data segment. Therefore, references to global or static variables on 
PA- RISC machines require two instructions to form the address of a variable or to 
load (or store} the contents of the variable For example the instructions 

ADDIL LR r var-SglobalS r dp 
LDW RR'uar-SglobalSlrD.rlQ 

load the contents of contents of a global variable into register 10 

The ADDIL instruction constructs the left side of the 32-bu virtual address. In most 
cases, however, the data is within reach of the load or store instructions, and an 
unnecessary ADDIL instruction is present in the code Since ADDILs account for 
about ?% nf :m generated cnde. significant runtime savings result from their 
removal. 

If the location for the variable turns out to be close to the global data pointer dp. 
then the offset at the ADDIL is zero and the ADDIL is like a COPY of global base 
register 27 [the location d dp) to register 1 , In such a case, it is more nrticinnt to 
eliminate the ADDIL and use register 27 as the base register in the LDW instruc* 
iKir: This elimination can be performed at link time once the linker lays out all the 
global data and computes the value that will be assigned to dp. 

The D linker optron turns on linker optimizations Link-time optimizations include 
removing the unnecessary ADDILs Data is also rearranged to increase the number 
of data items that can be accessed without AOOJLs The -0 option is passed to the 
linker by the compilers when the +03 compiler option is selected. The +03 option 
also advises the compiler not to optimize by moving ADDILs out nf loops, in the 
expectation that they will be removed at link Lime. This can be very effective m 
reducing register pressure for some procedures. For example, tn optrmize a C 
program at link time as well as compile time, use cc +03 foo.c 

Because shared libraries on HP-UX use position independent code that is refer - 
enced from register 1 9 as a base register, ADDIL elimination is not done when 
building an HP-UX shared lihrary. It is afeo in conflict with the -A (dynamic linking) 
option, the -r (relocatable link} option, and the -g jsymbolic debugging] optton. All 
conflicts are resolved by disabling this optimization. Shared libraries and position 
independent code are described an page 4B 

The linker rearranges data to maximize the number of variables that can be placed 
at the beginning of the data area, increasing the probability that ADDILs referenc- 
ing these variables can be removed. Nonstandard, conforming programs that rely 
on specific positioning of global or static variables may not work correctly after 
this optimization. 

ADDfL elimination is appropriate for programs that access global or static vanables 
frequently Programs not doing so may nor show noticeable improvement. Link- 
time optimization increases linking time significantly (approximately 20%) because 
of the additional pmcessmg and memory required. 



Profile -Based Procedure Repositioning at Link Time 

Research has consistently shown that programs tend to keep using the instructions 
and date that were used recently. One of the corollaries of this principle is that 
programs have large amounts of code land to a lesser extent data) that is used to 
handle thmgs that very seldom happen, and therefore are only in the way when 
running normal cases. 

This observation is exploited by a new optimization m the HP-UX 8.05 linkers 
called profile-based procedure repositioning I sometimes referred to as feedback- 
directed positioning) T Thfs three-step optimization first instruments the program to 
count how often procedures call each other at run time. The instrumented program 
is run on sample input data to collect a profile of the calls executed by the pro- 
gram. The linker then uses that profile information in the final link of the produc- 
tion program to place procedures that call each other frequently close together 

A more important case is the inverse — things that are infrequently or never called 
are grouped together far away from the heavily used code Thus increases mstruc- 
t ion-cache locality and in large applications decreases paging, since only the code 
that will be used is demand-loaded into main memory or cache, not a mixture of 
useful and imneeded code that happens to be allocated to the same page or cache 
line. 

This optimization is invoked by two new linker options; 

» -I: Instrument the code to collect procedure call counts during execution This 
option is used in conjunction with the -P option 

► -P' Examine the data file produced with the -I option and reposition the proce- 
dures according to a "closest is best" strategy. 

These options are often passed to the bnker via the compiler driver program's - W 
option. For instance, a C program can be optimized with profile-driven procedure 
positioning by: 



cc -c -D (dd.c 
cc -WL-I fao.o 
a. out < data in 



CC -Wl.-P fDD.O 



# compile with optimizations 
¥ link wiih profiling instrumentation code added 

# run program to generate profile information 

# in the "flowdata" file 

# link with procedures positioned according to 

# profile 

The first link of the executable produces an executable with extra code added to 
produce a file of profile information with counts of all the calls Petween each pan 
of procedures executed The final link uses the profile data file information to 
determine the order of procedures in the final executable file, overriding the nor- 
mal positioning bv the order of the input files seen. This order will optimize use of 
the virtual memory system for the program's code segment A secondary effect is 
to reduce the number of long branch stubs (code inserted to complete calls longer 
than 256K bytes) While the total number of long branch stubs may actually in- 
crease, the number of long branches executed at run time will decrease. 

Carl Burch 

Sofiware Engineer 

California Language Laboratory 
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HP 9000 Series 700 FORTRAN 
Optimizing Preprocessor 

By combining HP design engineering and quality assurance capabilities 
with a well-established third party product the performance of Series 700 
FORTRAN programs, as measured by key workstation benchmarks, was 
improved by more than 30%. 

by Robert A. Gottlieb, Daniel J. Magenheimer, Sue A. Meloy, and Alan C. Meyer 



An optimizing preprocessor is responsible for modifying 
source code in a way that allows an optimizing compiler 
to produce object code that makes the best use of the 
architecture of the target machine. The executable code 
resulting from this optimization process is able to make 
efficient use of storage and execute in minimum tiftm 

The HP 9000 Series 700 FORTRAN optimizing preproces- 
sor uses advanced program and data flow analysis tech- 
niques and a keen understanding of the undcrlyinu 
machine implementation to transform FORTRAN source 

CC>de into code that Ss more eftlrienl ;iml makes rails to a 
highly tuned vector library. The vector library is a group 
of routines written in assembly language that are tuned lo 
run very fast (sec "Vector Library ,n on page 29), Fig. 1 
shows the data flow involved in using the optimizing 

FORTRAN 
Source 
Code 



FORTRAN 

Optimizing 

Preprocessor 



Optimized FORTRAN Code 



FORTRAN 
Optimizing 

Compiler 



Optimized Object Code 




Fig. 1. 1 feta tli m fCf compiling PI >RTK \\ - <iin ■ code using the 
optimizing preprocessor. 



preprocessor to transform FORTRAN source code into an 
optimized executable file. 

A slightly different version of this product serves as the 
preprocessor for IIP Concurrent FORTRAN, which is now 
running on HP Apollo DN 10000 computers. HP Apollo 
engineers responsible for this product identified opportu- 
nities for substantial improvements lo the preprocessor 
and concluded that these improvements were also nppli 
cable to the Series 700 FORTRAN. Performance analysis 
confirmed these conclusions, and after marketing analysis, 
an extended inulusiie, cross -functional team was formed 
to incorporate the p re processor into the FORTRAN 
compiler for die IIP 9000 Series 7(X) computer systems. 
Because of this effort, as of the HP-UX* 8.05 release, the 
preprocessor is bundled with every FORTRAN compiler. 

The preprocessor is based on a third-party product. HP's 
contri but i on mc hid eci : 

• Tying the preprocessor into the HP FORTRAN product 
(This included user interface changes and extensive doc- 
umentation changes.) 

• Identifying modi Ileal ions required to allow ihe preproces- 
sor to recognize HP's extended FORTRAN dialect 

• Assembly coding a vector library that incorporates 
knowledge of CPU pipelining details and implementation 
dependent instructions to allow the Series 700 to work al 
peak performance 

• Performing extensive quality assurance processes that 
uncovered numerous defects, ensuring that the product 
meets HP's high-quality standards. 

These contributions are discussed in detail in this article, 
Examples of specific transformations and performance 
improvements on key industry benchmarks are also 
described. 

Preprocessor Overview 

Although the preprocessor is bundled with even Series 
7C"! FORTRAN compiler as of Ihe HP-UX $M release, the 
preprocessor is not automatically enabled whenever a 
user compiles a FORTRAN program. To invoke the 
preprocessor, the +DP option must be specified OH the 
command line invoking the FORTRAN compiler. For 
example, 
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f77 +0P file.f 

will cause the file file.f to be preprocessed and then 
compiled by the FORTRAN compiler, In addition, an 
integer between and 4 can be appended following vQR 
["[lis integer selects the settings of certain preprocessor 
options, For example, to make the preprocessor optimize 
as aggressively as possible, the following could be used: 

T77 +QP4 file.f 

By default, the +0P option also automatically invoke- 
standard optimizer at the optimization level defined by 
the -0 option, which typically indicates full optimization 

(-02). 

Advanced Options. The preprocessor can be invoked with 
many options by using the -WP option. For example, 

f77 +QP WP, -novectorize file.f 

precludes the preprocessor from generating calls to the 
vector library. Some other classes of options include; 

• Inlining Options. These options instruct the preprocessor 
ro replace subroutine or function calls with the actual 
text of the suhroutine or function. This removes the over- 
head of a procedure call and exposes adriit ional opportu- 
nities for optimizations. These options allow ihe user not 
only to instruct the preprocessor whether or noi lu Inline. 
but also to provide the maximum level of subprogram 
nesting and lists of files to examine for inlining* The user 
can exercise manual control over inlining with directives, 
and impose restrictions on inlining in nested loops. 

• Optimization Options. Using optimization options, the 
user can adjust parameters that control loop unrolling, 
transformations that may affect arithmetic roundoff, 
and the aggressiveness of the optimizations that are 
attempted. 

• Veclurjzation Options. These options tell the preproces- 
sor whether or not to generate calls to the vector library 
and adjust the minimum vector length that will cause 
such a call to be generated. 

• Listing Options. The user can obtain detailed information 
about the program and the optimisations performed by 
the preprocessor with listing options. Also, the user can 
adjust the format and level of detail of the information in 
Ihe listings, 

• Other Options. Some options specify whether certain 
directives (described below j are to be recognized by the 
preprocessor and what global assumptions can be made 
about the behavior of the user program. There are also 
options that allow the user to designate special inline 
comment characters to be recognized and whether to 
save program variables in static memory. 

Directives, The preprocessor provides an extensive set of 
directives. These directives can be inserted directly in the 
FORTRAN application and appear to the compiler as 
comments except when enabled by certain command-line 
option*. Placeman I Of these directives in the code allows 
the user to vary control of ihe optimizations performed 
by tlie preprocessor in each subprogram. This control can 
have the granularity of a single line in a subprogram. 

Some of the features provided by directives include: 



• Optimization Control. Optimization directives pro\ide 
control of inlining, roundoff, and optimization aggressive- 
ness. 

• Vector Call Control. Vector call translation directives con- 
trol substitutions that result in calls to the vector library 
from the preprocessor 

• Compatibility. Certain directive formats used by competi- 
tive products are recognized to allow correct optimiza- 
tions to be performed on supercomputer applications. 

• Assertions. Assertions can be inserted in an application 
to allow the user to provide additional program informa- 
tion thai will allow the preprocessor to make informed 
decisions about enabling or disabling certain optimiza- 
tions. For example, many FORTRAN applications violate 
array subscript bounds. If the user does not inform the 
preprocessor of this language standard violation, trans- 
formations may be performed that result in incorrect 
execution of the program. 

Transformations 

The HP FORTRAN optimizing preprocessor supports a 
number of different t ran sformat ions (changes to the 
source code) that are intended to improve the perfor- 
mance of the code, These transformations include the 
following categories: 

• Scalar transformations 

• Interprocedural transformations 

• Vector transformations 

• Data locality (blocking) and memory access transforma- 
tions. 

Scalar Transformations. Many of these transformations are 
"enabling 7 ' optimizations, Thai is, they are necessary to 
expose ear enable opportunities for the other optimiza- 
tions. Some of these transformations include: 

• Loop Unrolling. This transformation attempts to com- 
press together several iterations of a loop, with the intent 
oT lowering the cost of the loop overhead and exposing 
more opportunity for more efficiently using the functional 
units of the PA-RISC architecture. The article on page 89 
provides some examples of loop unr oiling. 

• Loop Rerolling. This transformation is the exact opposite 
of loop unrolling in thai il is used when a loop has been 
explicitly unrolled by the user, The transformation recog- 
nizes that the code has been unrolled, and re rolls u into ;i 
smaller loop, This may be beneficial in cases where the 
code can be transformed to a call to the vector library, 

• Dead Code Elimination. This transformation removes 
code that cannot be executed. This can improve perfor- 
mance by revealing opportunities for other transforma- 
tions. 

• Forward Substitution. The preprocessor replaces refer- 
ences to variables with the appropriate constants or ex- 
pressions to expose opportunities for additional trans- 
formations. 

• Induction Variable Analysis. The preprocessor recognizes 
variables that are incremented by a loop-invariant 
amount within a loop, and may replace expressions using 
one induction variable with an expression based on 
another induction variable. For example, in the following 
rode fragmeiil thr preprocessor identifies that K is ;mi 
induction van able; 
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DO I = l.N 

M\) = BMC) 

K = K - 1 
ENDDO 

The code generated by the preprocessor would be; 

00 I = 1,N 

Ail} = BfK I- 1 1 
ENDDO 



* Lifetime Analysis, The preprocessor analyzes the use <>i 
variables within a routine, and del ermines when the value 
of a variable can be discarded because it will not be 
needed again. 

Interprocedural Transformations. The preprocessor is capa- 
ble of performing subroutine and function inline substitu- 
tion. This optimization allows the preprocessor, either by 
explicit user control or heuristically, to replace a call to a 
routine with the code in the routine. This transformation 
improves performance by: 

* Reducing call overhead, which is useful for very small 
routines 

* Replacing expressions in inlined subroutines or functions 
with constants because some arguments to these rou lines 
might he constants 

* Expos i ng other pert ormai ice imp rovement opport unities 
such as data locality. 

Vector Transformations. The preprocessor replaces code 
sequences with calls to I he vector library whin- appropri- 
ate. Some classes of these calls include: 

* Loop Veotori/aiion. This refers to eases in which the 
users code refers to one or several sequences of inputs, 
producing a sequence of outputs. These sequences would 
be references to arrays, For example, 

□ 10 I = 1 r N 
10 AW ; BO! + CM 

would become: 

CALL vec.Sdadd vector|B|l) r C(l|,N,AHI) 

Not all seemingly appropriate places would he vectorized 
because in some cases multiple references to the same 
subscripted variable mighi be more efficiently do$€ b> 
inline code rather than by a call to a vector library 
routine. 

* Reduction Recognition. The preprocessor will recognize 
some cases in which I he results an accumulated for use 
as an aggregate, such as in summing all the elements in 
an array or finding ihe maximum value in an array. For 
example, 

DO 10 1 = 1, N 
10 X = X + A(l) * B(l) 

would become: 

X = X + vec_$ddot(A(ll,Blll.N) 

This transform improves performance in pan by knowing 
that while a Series 700 computer can add one stream of 
numbers in three machine cycles per element, it can also 
add two streams of numbers in four machine cycles per 
two elements. 



There is one problem with this transform. When using 
two streams to compute the resuli (which is whai the 
routine does) in floating-point calculations, changing ihe 
order in which numbers are added can change ihe result. 
This is called roundoff error. Because of this problem, the 
reduction recognition transformation can be inhibited b> 
using the roundoff switch. 
* Linear Recurrence Recognition. This transformation is 
used in cases in which the results of a previous iteration 
of a loop are used in the current iteration. This is called a 
recurrence; 
Example: 



DO 10 I = 2, N 

10 A{lf = B (II 



C*A(I-1) 



In ibis ease the lib element of A is dependent on the 
result of the calculation of the (1-1 )th element of A. This 
code becomes: 

CALL vec_$rec1criB|2),N-1,C,A(1)) 

Data Locality and Memory Access Transformations, Memory 
side effects such as cache misses can have a significant 
impact on the performance of Lhe Series 700 machine. As 
a result, a number of transformations have been devel- 
oped to reduce the likelihood of cache misses and other 
problems. 

Stride I Inner Loop Selection. This transformation ex- 
amines a nested series of loops, and attempts to deter- 
mine if ihe loops can be rearranged so that a different 
loop can run as the inner loop. This is done if the new 
inner loop wiU have more sequential walks of arrays 
through memory- This type of access is advantageous 
because it reduces cache misses. For example, 

DO 10 I = 1, N 

DD 10 J = 1, N 
10 A{I.J) = BtlJ) + CUJI 

accesses the arrays A T B t and C, However, it accesses 

them in Ihe sequences: 

AH .11. Afl.2). A(1,3l ,.. 
B(1,1L BU,2). B|1,3), ... 
CO,!}, CfUh C(1,3),. M 

which will result in nonsequential access to memory, 
making cache misses far more likely. The following legal 
transform will reduce the likely number of cache misses. 



DO 10 J = I N 
DO 10 l«= L N 
10 A(I,J1 = BtUJ * ClUI 



These loops have 
been exchanged 



• I lata Locality Transformations. For situations in which 
there is significant reuse of array data, and there is oppor- 
tunity to restructure, or ""block,* 1 the code to reduce cache 
misses. Ihe preprocessor will create multiple nested 
loops thai will localize the data in the cache at a COS! of 
more loop overhead. 

• Matrix Multiply Recognition. The preprocessor will recog- 
nize many forms of classic matrix multiply and replace 
them with calls to a highly tuned matrix multiply routine. 
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Example of a Transformation, The following code fragments 
taken from the Matnx300 benchmark of the SPEC bench- 
mark tests show how some of the transformations de- 
scribed above are incorporated into a FORTRAN program, 

REALMS A<3O0.3O0L B(3D1,301L C(3DZ3G2) 

DATA M, K L /3QQ P 30O30G/ 

IA = 3Q0 

IB = 301 

IC = 302 

CALL SGEMMIM. N, L A, [A, B, JB, C, JC. Q, +1) 

END 

SUBROUTINE SGEMMIM, N, L, A, IA, B, IB, C, IC. JTRPOS, JOB) 

REAL*8 AllA r N), B(1B,U, CIICJJ 

JB = ISIGNIIABS<J0B) + 2*IJTflPOS/4),JOB) 

JUMP = JTRPOS f 1 

GO TO lift 30}, JUMP 
ID CONTINUE 

DO 20 J = 1, L 
20 CALL SGEMVIM, N, A, lA r B(1 H J) f l r C[1 r J), I, JB) 

RETURN 
30 CONTINUE 

DO 40 J = 1, L 
40 CALL SGEMVjM, N, A, IA r B(UL 1, C|J,1), IC, JB) 

RETURN 

END 

SUBROUTINE SGEMVIM, N, A, 1A X, IX, Y, IY, JOB) 

REAL*8 A(fA,N), X[IX,N), Y(IY,N) 

IF JNIE.Q) RETURN 

II = 1 IJ = IA 

IF ({(IABSUOBHI/2}.EaO) GO TD 210 

kU 1 

I) = IA 

CONTINUE 

IF (M0D|IABS(J0BH,2) NEC) GO TO 230 

DO 220 J = 1, M 

Y(1,JJ = 0.000 
CONTINUE 

IFIJOB.LIOI GO TO 250 
DO 240 J = 1, N 

K ^ 1 + (J-1)*IJ 

CALL SAXPYIM, X(1,JJ, A(K r t], II, Y(UJ, IY) 
RETURN 
CONTINUE 
DO 260 J S | ( N 

L = 1 + (J-1)*IJ 

CALL SAXPYIM, -XfUL A<L,tt, ll # YIUI, IY) 
RETURN 
END 

SUBROUTINE SAXPYtN, A, X, INCX r Y, INCY) 
REALX(INCX,NL YHNCY.NL A 
IF(NIE.Q) RETURN 
DO 310 I = 1, N 
310 YtlJJ = Y(1,l) + A*X|1 f l) 
RETURN 
END 

First, routine SGEMM is iniined into the main muTinr, and 
the scalar forward substitution transformation is applied 
to propagate arguments. 

REAL*8 A(3G0,3DOL Bj3Q1,3QU C(302,302l 
DATA M, N, L /300,300,300/ 
IA = 300 



210 



220 
230 



240 



250 



260 



IB = 301 

IC = 302 

JR = 1 

JUMP - 1 

GO TO [Ift 30), 1 
10 CONTINUE 

DO 20 J - TL 
20 CALL SGEMVIM, N, A, IA r BRJL L CMA I, JB) 

RETURN 
30 CONTINUE 

DG4J3J = !,L 
40 CALL SGEMVfM, N, A, I A Bd.Ji, I. CU.1L IC, JB) 

RETURN 

END 

Second, dead code elimination is applied. The computed 
GO TO turns into a simple GO TO, and the unreachable code 
is removed. 

REALMS AOOQ.SOO), 6(301,301), 0(302,302) 

DATA M, N, L /300,300,300/ 

IA = 300 

18 = 301 

IC = 302 

JB = 1 

JUMP = 1 

DO I = 1,1 

CALL SGEMV(M,NAIAB(1J) r 1 H Cnj),UB) 

END DO 
RETURN 
END 

Next, lifetime analysis is applied to the rode, and it is 
seen that with the current code configuration the vari- 
ables L, IB. IC. and JUMP are never modified after the 
initial assignment. 

REAL*B AOT.300), B(301 r 30l), C{302,302) 

DATA M; N /300,3DQ/ 

IA - 300 

JB = 1 

DO 1^1, 300 

CALL SGEMVIM, N J AJA r B(1 I lL1,Cf1,l[,1 r JB) 

ENO OD 
END 

Notice that t he large body of conditional code has been 
removed. This is significant as far as the capability to 
perform further optimizations is concerned. The reason 
tluit M, N, and I A were not replaced with the value 300 is 
thai at rhi.H point it is not known that Ihc corresponding 
arguments to SGEMV are m>1 modified. 

Next, the routine SGEMV is iniined, and once again, a 
number of transformations are applied: 

REAL*8 A(300,300), B(301 r 301) r C(302,302) 
DATA M/3O0/ 
DO I = 1, 300 
13- 1 

DO J = 1, M 
C(J,I) = 0.0D0 

END DO 
DO J = 1, 300 

CALLSAXPY(M.BU.I),A(1 1 J).I3 1 C!1J)TI 
END DD 
END DO 
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END 
Now, we mime SAXPY to get 

REAL*8 A(3O0,3O0h 8(301,301 f, 0(302,302) 
00 I = 1,300 
DO J = 1, 300 
C(JJ)^0,DD 

END DO 
DO J = 1, 300 
DO K = !, 300 
CIKJ) - CfKJl * BtJ.I} * A(K.J> 
END DO 
END DO 
END DO 

Finally, we see thai this is a matrix multiply and trans- 
form it into a call lo a fast matrix multiply routine: 

CALLBUSJDGEMMCN'/N.aOO^OO^OOJDO, 
X ^1,1^3003(1,1^301,0.00^(1,11,302} 
END 

II i is set of transformations results in an 11 x perfor- 
mance improvement because of the ability to transform 
the original rode to a form that can use blocking effi- 
ciently via a coded matrix multiply routine. 

Matching the IIP FORTRAN Dialect 

Although a primary motivation for using the preprocessor 
was the significant performance gains, it was also very 
important for the preprocessor to work as an integrated 
component of the FORTRAN compile path. One key 
aspect to this integration was for the preprocessor to 
recognize and correctly process the dialect extensions 
supported by the HP Series 700 FORTRAN compiler. 

Three dialect areas were addressed: language extensions, 
compiler directives, and command-line options. For each 
of these areas, there were some items that the preproces- 
sor could just ignore, while others required certain actions. 
.Another aspect of the dialect issue is that the trans- 
formed FORTRAN code generated by the preprocessor 
must conform to HP's FORTRAN dialed. 

The first task was to define the list of HP dialect exten- 
sions the preprocessor had to recognize. The initial pass 
at this was done by gathering all known extensions in 
HP FORTRAN including the military extensions (MIL- 
STD-1753}, VAX FORTRAN 4.0 features, Apollo Domain 
DN10000 features, and other IIP extensions. This list was 
given to the third-party developers as a starting point for 
implementing HP dialect recognition in the preprocessor. 

The next step in defining the dialect extensions was to 
push the preprocessor through our extensive FORTRAN 
test suites. These suites contain over 85(H) tests, ranging 
from very simple programs to large FORTRAN applica- 
tions. The method we used was to run each positive test 
(no expected failure) with the preprocessor, and compare 
the results with the expected answers. In this manner, we 
were able to collect additional dialect items that needed 
to be added to the preprocessor. The final set of dialect 
items came as we entered a beta program later in the 



release, exposing the preprocessor to sets of customer 
codes. 

There were a large number of language extensions the 
preprocessor did not originally recognize, but they were 
generally relatively minor features. One example is the ON 
statement, an HP extension that allows specification of 
exception handling. The preprocessor merely had to 
recognize the syntax of this statement and echo it back 
to the transformed file. Another example was allowing 
octal and hexadecimal constants to appear as actual 
arguments to a statement function, 

The HP compiler directives also needed to be recognized, 
sometimes requiring semantic actions from the preproces- 
sor. As an example, consider the code segment: 

INTEGER A{10|, BOOK C(10), D 

DO I = 1, 10 
AM = B(l) * HI) f D 

ENDD0 
The preprocessor will transform this to the following 
vector caQ: 

CALL vec_Simult_adcUonstant (BOhCnMOAAd)) 

However, if the SSHORT directive is present in the file, the 
preprocessor will instead generate a call to the short 
integer version of this vector routine: 

CALL vec_Simult_add_constantl6 (B(l| t CnhlO,D,A(1)J 

Most of the directives, such as WARNINGS, are ignored by 
the preprocessor. 

There were also a number of FORTRAN command-line 
options that the preprocessor needed to be aware of For 
example, the -12 option specifies that short integers will 
be the default, which should cause the same effect as the 
SSHORT directive in the example above. For each of these 
options, the information was passed via preprocessor 
command-line options. In the case of the -(2 option, the 
FORTRAN driver will invoke the preprocessor with the 
-int=2 option. 

Mother interesting command-line option is +DAfQ. which 
indicates that the resulting executable program can be 
run on a PA-E1SC 1.0 machine. Since the vector library 
contains PA-RISC 1.1-specific instructions, the preproces- 
sor is informed that no vector calls should be generated 
in the transformed source by passing it the -novectorize 
flag. 

In addition to having the preprocessor recognize the HP 
Series 700 FORTRAN dialect, there was a need lo ensure 
that the resulting transformed source from the preproces- 
sor would be acceptable to the standard FORTRAN 
compiler. This situation actually occurred in several 
different cases. In one ease, the preprocessor generated 
FORTRAN source code in which DATA statements appear 
amid executable statements, something the compiler 
allows only if the -K command-line option is present, The 
solution was to have the preprocessor change the order 
ill which it emits the DATA statements. 

|continuedcnpage30) 
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Vector Library 

The vector library (S a collection 
tuned to achieve e:l: perfan 



nes written in assembly la- : 
iaPA-fllSC V* mart - ne 



At tne lime - -^processor to the HP-UX operating sv - 

HP Concurrent FOifTRAN compiler team had already been working with the 
" " " :£ly to generate sal Is to the vector I j bra ry ava i labfe 1 n the Oomai n op - 
system. To have a single interface *or oorn products, we decided to provide me 
Domain horary interface on HP-UX. 

The Domain library consists of 57 basic functions, with variations for different 
types and variable or unit stride.* for a total of 380 routines. However, not all of 
these routines are generated by the preprocessor. Table I Itsts some of the 39 basic 
routines that are generated by the Series 700 preprocessor 





Table 1 




Some of the Basic Vector 


Library Routines 


Routines 




Operations 
Unary 


vec_Sah5(U, Count, R) 




R(i| = JU(i)l 


vec_5rteg(U, Count, FN 




Rfl) = -U(U 
Scalar-vector 


ve£_$arfd_constant1'U\ Count, a, Rt 




Rfi) = a + UW 


vec_£rnufLconstant(u\ Count, & f R) 




Rd) = a * U(i) 
Vector- vector 


vec_Ssub_constant(U, Gaunt, a, R) 




RfME-Uffl 


vec.£add_vectar(U, V, Count RJ 




R|i)-U[i)^V|i) 


vec_Smult_wector1U, V, Count, HI 




R(i) = U{i) x Vji) 


vEc_$5uo_uecrQr(U r y, Count, R] 




Fsr.ii mi-v$ 



vec n SBdd_mulLvector(U, V, W 4 Count R) 

vec_$mult_add_vector(U 4 V r W, Count, R) 
vecj5murt_rsub_vector(U, V, W, Count, H) 
vec_Smuft„sub_vector(LJ, V, W, Count, R) 



vec_$add_mult{U J V,. Count, a, R) 
vac_Smult_add{U, V, Count, a, R] 
wec_Smult_sub{U. V", Count, a, R) 



vec_Sadd.,mult_constant(U, V, Count, a, R) 
vac_Smult_add_cqnstant(U r V, Count, a, R; 



vec_Sasum(U, Count) 
vec_S5um[U, Count) 
vec_$dot(U, V, Count} 



vec_Srecl(U, V, Count, Rf 
vec_$rBclc(U, Count, a, R| 



vee_Scopy[U, V, Count] 
vec_Sinrt|U,V, Count,! 



Vector- vector- vector 
RfiMUdh-VIW* WfJ) 
R|i) = (U(i|x VfirKWfi] 
R|i| B -|U[ipx VMM + W{r) 

|Nii) * ■ VNii ww 

Scaiar-vector-vector 
rKt)»ia*V«)| x u(,f 

R{i) = (a * VMJtUdf 
RUM* x VOH 

Vector-vector-scalar 
RfrHW>*WJ| x a 

R(it = fUH1 x Villi -a 

Summation and dot product 
result = SUWI(IU(iW 
result = 5UM(U|j|| 
result = SUIvKUdl x Vfi)) 

Linear recurrences 
R(M) = lJ|i| + V(i| x R(if 
R(kl) = UW + ax (a x Nil) 



Copy and mitiaVai m 
V<i) = U(i) 
UTO*i 



For most of these basic routines there are eight versions for handling the varia- 
tions in type and stride. Tabte II lists the eight vers tons for vec_ Sabs, the routine 

that computes m absolute value 

' Slnde is the number of anay aiemants that must be skipped over when a subscript's 
value is changed by 1 



Table II 
Different Versions of vac Sabs 

Characteristics 

- 
eixecisron floating-point, unit 
stride 
32-bit integer, unit stride 

■ ^iTirJe 
Single-precision flcating-pomt, variable 



Routine 

vec_Sabs{U.C 
vec_Stfabs{U, Count, R) 



vec_$abs(LI, Count R> 
vec_5cab5l6lLi.Count.Rl 
vec_Saos_itU ( Stride 1 H Count R, 
Stride 2) 

vec_Sdabs_iiU, Stride! r Count, R, 

Stride 2\ 
vec_&abs_NU, Stridel, Count R< 

Stride 21 
vec_Siab5l6_i{U, Stride 1. Count R 
Stride 2\ 



stride 

Double-precision floating-point, variable 
sinoe 

32-bit integer, variable si I 



IB-bit integer, variable stride 



Because of time constraints, we could nut hand-tune every routine, so we chose to 
concentrate on those that would derive the most benefit from tuning, for the rest, 
we used FORTRAN versions of the routines. Some of those routines were run 
through the preprocessor tD unroll the loops and/or run through the software 
■r,q optimizer to get the best possible code with minimal effort. 

Machine- Specific Features. Two features of PA-RISC 1 I that hand-tuning was 
able to :ake :nar.icu!ar advantage of are the FMPYADD and FMPVSUB combined 
operation instructions, and the ability to use double- word loads into 32-bit floating- 
point register pairs. In addition, floating-point instruction latencies provide The 
greatest opportunities tor scneduling 

Because of these factors, we felt that floating-point routines would benefit more 
from hand-tuning than integer routines. In particular, 32-brt floating-point routines 
can exploit the double-word load and store feature, which is currently beyond the 
caoabiiities of the optimizer. 

For some of the most critical routines, we used a n on architected instruction avail 
able in this particular implementatron of PA-RISC I 1 to do quad-word stores. This 
instruction requires longer store interlocks so it isn't always worthwhile to use it, 
but it was able to improve some routines by about 1D%. 

Double-Wurd Load and Stores. To use double-word loads and stores for single* 
rjrei; siuri vectors, care must be taken to ensure That the addresses are properly 
aligned. PA-RISC 1 1 enforces strict alignment requirements on data accesses 
Thus, a single- word loao must reference a word-aligned address, and a double- 
word load most reference a double- word-aligned address For example, take two 
single-precision- vectors 

REAL*4A(4},B[4f 
The elements of arrays A and B might be laid our in memory as shown in Fig 1 

Suppose we want to copy vector A to vector B. If we use singie-wod loads and 
stores, each element will be accessed by a separate load or store. There is oo 



40000000 
40000004 
4ODDO0O8 
40O0OQ0C 
40O0001Q 
40000014 
4000001 8 
4000001 C 



AH} 


Wordl 
Word 2 


A (2) 


A« 


Word 3 
Word 4 


m 


BID 
B{2) 


WordS 
Word 6 


BO) 
8(4) 


Word? 
Word a 



Doubts Word 
Double Word 
Double Word 
DojjIiIh Word 



Frg, 1, The arrangement of vectors A and 8 in tnemory, All f .he Rlemenrs are double- wortf- 
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40000004 




10000008 


A (2) Word 2 


4000000C 


A (3) Word 3 


4Q0Q0Q1Q 




400000H 




40000019 


BI2I Word 6 


4000001 C 


eat 


1 Word? 


40000020 





Double Ward 



Double Wtird 



Fig. 2. The arrangement of vectors A and B In mammy when the same of the elements are 
not dauble-wmd-alignEd 

problem with alignment because each element is aligned on a word boundary This 
method requires four loads and four stores. 

Since vectors a and B are both double-word -aligned {the starting address is a 
multiple af eight), we can use douiile-woid loads and stores, cutting the number 
of memory accesses in half. The first load will load AjiJ and A{2) into a double 
floating-point register. The first store will store that register into B(i) and B17I 
This method requires only two loads and two stores 

In Fig. 2 we have a case in which the starling addresses of the vectors are not 
rJoub I e-word-a Signed. In mis case only elements 2 and 3 can be copied using 
double-word loads and stores The first and last elements must use single-word 
accesses because of alignment restrictions. 

Special code is required to handle all the possible alignment combinations for the 
different vectors in a library routine For example, there are 16 different possible 
alignment combinations for vec_Smult_suh_wecior. 

We reduced the amount of code needed to handle all these combinations by per- 
forming a single iteration for some combinations, men jumping to the code for the 
oppositecombinaiion For example, if vectors 2 and 4 are double-word aligned and 
vectors 1 and 3 are not, we can perform the operation on one element, which 
effectively inverts the alignment combination. Vectors 1 and 3 will now be double- 
word alsgned, and vectors 2 and 4 will not We can then go tu the cuoe for me 
vector combination unahgned-ahgned-unaligned-aligned, which takes advantage 
of double-word load and store instructions for that particular alignment combination. 

We also took advantage of commutative operations to reduce the amount of code 
we had to write Again, for vec_SmuiLsub_vector, the mulTi plot ton is commuta 
live, so we can swap the pointers to vectors 1 and 2, then jump to the code for the 
commuted combination. 

Using these techniques, we reduced the number of different combinations that had 
to be tuned for the routine uec&nu ft sub vector from 16 to six 



Instruction Scheduling 

The instruction scheduling for the vector library is tuned for this particular imple- 
mentation of PA- RISC 1 1 The characteristics of other impfementations could very 
well be different. 

There are requirements for minimum distance between certain types of instruc- 
tions to avoid interlocks To make the most efficient use of the procesor, the 
instruction sequences for independent operations can be interleaved This is 
known as software pipelining, which ss discussed in the article on page 39 



Another aspect of this issue Ls that the transformed 
FORTRAN source code is often structured differently 
from what a human programmer would write, exposing 
the FORTRAN compiler Lo code it had not seen before. 
The result is thai we uncovered (and fixed) several minor 
compiler defects both in the FORTRAN compiler front 
end and the optimizer. 



In Pursuit of HP Quality 

Early in rite HP-UX 8.05 release cycle, as potential perfor- 
mance benefits were identified, a commitment was made 
lo usr ilu- propivjiv.ssur and to delivei specific p€!lf0r 
rnance on key benchmarks and applications. The subse- 
quent development effort involved a geographically 
distributed UP Learn working together with the third 
party — all on a very tight schedule. In this situation, close 
attention to the quality assurance process was required. 
Three general areas of quality were addressed: 

• Performance testing for both industry benchmarks and 
general applications 

• Correctness of preprocessor source transformations 

• Preprocessor acceptance of the HP FORTRAN dialect 

To address these quality issues, the following steps were 
taken: 

• Identification of a test space to use for testing purposes 

• Initiation of a beta test program 

• Choice of a method for tracking and prioritizing outstand- 
ing problems 

■ Development of a regular testing and feedback cycle. 

Identifying the Test Space. For performance related testing, 
standard benchmarks such as the SPEC 4 benchmark 
programs and Unpack were used. Since we had com- 
mitted to specific performance numbers with these 
benchmarks, it was crucial to monitor their progress. As 
the release progressed , performance related issues also 
came to our attention through runs of an internally 
maintained application test suite as well as from HP 
factory support personnel and from a beta program. 

While some of the performance tests did help test the 
correctness and dialect issues of quality, we wanted to 
identify a set of programs or program fragments specifi- 
cally for these purposes. White box testing was provided 
by the third party. For HP's testing process, we viewed 
the preprocessor as a black box. concentrating on its 
functionality to I he FORTRAN user To I his end, we chose 
to concentrate on the same test bed that we use for 
quality assurance on the Series 700 FORTRAN compiler. 
In addition, to get further exposure to typical FORTRAN 
programs, we also developed a beta program. 

This choice of a testing space did not test the complete 
functionality of the preprocessor. For example, procedure 
iniining was performed when the preprocessor was run 
on our test suites, but for this release we did not develop 
a set of tests specifically to test the iniining capabilities. 

Another issue in choosing the test space was to identify 
the command-line option combinations io test. In the case 
of the preprocessor, over 30 individual options are sup- 
ported, and when the different option combinations were 
considered, the complete set of option configurations 
became unreasonably large to test fully under our tight 
development schedule. 

To handle the option situation, we concentrated on 
configurations most likely to be used. In most situations, 
we anticipated that the preprocessor would be invoked 
through the FORTRAN compiler driver by using one of 
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the *QP options. These five options (+0H) ... tDP4) were 
designed to be most useful for general use. and they 
exercise many of the main preprocessor features. For this 
reason, we restricted the majority of our test runs to 
z the +0P options. 

Although not initially considered for testing purposes, the 
examples given in the FORTRAN manual also turned out 
to be important for tracking the quality of the preproces- 
sor. Many tests were written for the manual ro help 
explain each of the new features pro\ided hy The prepro- 
cessor. Running these tests during the release uncovered 
a few regressions (defects) in the preprocessor. These 
regressions were fixed* adding to the overall quality of 
the product. 

Beta Program. An important source of quality issues was 
the beta program initiated specifically bo gain additional 
exposure for the preprocessor. Since this was a new 
component of the FORTRAN compile path, it was espe- 
cially important to expose the preprocessor to existing 
FORTRAN applications. 

The results of the beta program were quite successful. 
Since some of the sites involved applications heavily 
reliant on the HP FORTRAN dialect, we uncovered a 
number of preprocessor problems concerning dialed 
acceptance. Performance issues were also raised front the 
beta program. In some cases significant performance 
gains were reported; in others, less success was achieved. 

As problems reported were fixed, updated versions of the 
preprocessor were provided to the beta sites. In this 
manner, the beta program provided another source of 
improvement and regression tracking. 

Problem Tracking. With the light schedule and mullisiie 
team, it was important to have a mechanism for tracking 
problems that arose with the preprocessor The purpose 
was to make sure that all problems were properly re- 
tarded, to have a common repository for problems, and 
to have a basis for prioritizing the outstanding problems. 

.Although many people could submit problems, a single 
team member was responsible for monitoring a list that 
described reported preprocessor problems and closing out 
problems when they were resolved. As part of this 
process, a test suite was developed that contained exam- 
ple code from each of the submitted problems. This test 
suite provided us with a quick method of checking 
recurrence of old problems as development progressed. 

Since this list was used to prioritize preprocessor prob- 
lems, the team developed a common set of guidelines for 
assigning priority levels to each submitted problem. For 
example, any item that caused a significant performance 
problem (e.g., slowdown on a key benchmark) would be 
assigned a very high priority, while problems with an 
infrequently used feature in IIP FORTRAN dialect process- 
ing were given a lower priority. 

During team conferences, the problem list was a regular 
agenda item. The team would review all ou I standing 
problems, adjusting priority levels as considered appropri- 
ate. In this manner, we had an ongoing lis! representing 
the problems thai needed to be fixed in the preprocessor. 



Testing and the Feedback Cycle. As part of any quality 
process, it is important to develop a regular set of 
activities to monitor improvements and regressions in the 
product. As the preprocessor release entered its later 
stages, we developed a regular weekly cycle that eouv 
cided with the program-wide integration cycle. The 
activities we performed during each week of this period 
included: 

• Review of the list of outstanding problems, identifying the 
next items to be addressed by the third party. 

• Weekly phone conference with the third party. These 
meetings provided close tracking of the problems fixed 
the previous week as well as a discussion of any new 
problems to be fixed the following week. 

• A regression test of the latest version of the preprocessor. 
Each week we received a new version of the preproces- 
sor containing the latest fixes. The testing involved run- 
ning our test suites and checking for any regressions, 

• The resolution of any fixed problems and updating the 
outstanding problem list. 

• A decision about allowing the latest version of the pre- 
processor to be submitted to system integration. Based 
on the results of test runs, the new preprocessor would 
be submitted ifit was superior to the previous version. 

The fast, regular feed bark of this process tow r ards the 
end of the product release cycle maximized the quality of 
the product within very tight delivery constraints. 

Performance Analysis 

The FORTRAN optimizing preprocessor has had a signifi- 
cant inipacl on the performance of FORTRAN applica- 
tions. While the performance improvement seems to vary 
significantly based on the specifics of the code, we have 
seen more than a 10 x speedup in some programs be- 
cause of improvement in data locality, which significantly 
reduces cache miss rates. Array manipulation also tends 
to show improvement, 

The Livennnre Loops are a collection of kernel loops 
collected by the staff at Argonne Laboratories, and are 
frequent iy used as benchmarks for scientific calculations. 
Table 1 shows I he performance results for these loops 
executing after being compiled with the optimizing 
preprocessor. 

The improvements in loops 3, 6, ll t 12 t and 18 were 
because of vectorization. Loop 13 benefited from loop 
splitting, while loop 14 benefited from loop merging, Loop 
16 gained from transforming "spaghetti code" bo struc- 
tured code. Loop 21 gained significantly Troni recognition 
of a matrix multiply and a call to a tuned and blocked 
matrix multiply routine. Note that because of either the 
options selected or the heuristics of the optimizer, loops 
5 and 8 degraded in performance. 

Matrix30u is a well-known benchmark in the SPEC bench- 
mark suite. The code in this routine performs eight, 
matrix multiplies on two 3004>y-30G matrices. In this rase, 
the application of blocking to the matrix multiply algo 
rithm had a significant i in pad on the performance of I he 
benchmark. Table II compares the results of running the 
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Matnx30D benchmark with and without the optimizing 
preprocessor. 

Note the significant reduction in cache miss rate because 
of the blocking techniques. This technique is applicable to 
a number of multidimensional matrix algorithms beyond 
matrix multiply. 

Besides benchmarks, we have seen some significant 
performance improvements in other applications when 
I he preprocessor is used. Although we had one case in 
which there was a 211% improvement, most FORTRAN 
programs have exhibited performance improvements in 
the 15% to 20% range. .Also, as in the Livermore Loops 
benchmarks, we have found a lew cases in which there 
WBS either no improvement in performance, or ;i degra 
dation in performance. We continue to investigate these 
cases. 
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Table il 
Performance of Matrix300 Benchmark 






Without 
Prepro- 
cessor* 


With 
Pre pro- 
cessor* 


dumber of float ing-point multi- 
plies 


54 


19 


Number of floating-point adds 


u 


Hi 


Number of FMPYADDs 


162 


197 


Number of floating-point loads 


432 


li.U 


Number of floating! mini stores 


217 


15 


Number of miscellaneous inst n u 
tions 


576 


33 


Number of cache misses 


64 
(7.29%) 


2.8 

(\A%) 


HP 9000 Model 720 SPECmark 


27.9 


330.0 


* Milhons of operations 







A c kiio wl e dgm e n ts 
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cessor, especially Boh Montgomery and Felix Krasovec. 
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years in FORTRAN preprocessor technology. IIP is 
pleased to be able to provide an industry leading FOR- 
TRAN product through the combination of Kuck and 
Associates' preprocessor and HP's FORTRAN compiler 
technology. 

HP-UX is based an and is ccnpatibls with UNIX System Laboratories 1 UNIX" operating sysren 
It also complies with X/Open's' XPG3. P0SIX 1003 1 and SVSD2 tnterface specifications 
9 registered trademark of UNIX System Laboratories Inc. mtfie U SA and other 
countries 
XyfJpen »s a trademark of X/Open Company Lirmted in trie UK and other counties. 



Loops 2, 9, 10, 19, and 20 yielded 0,0% 
Harmonic Mean = 1" 7" 
Median Rate = 20.6% 
Average Rate = 12.7% 
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Register Reassociation in PA-RISC 
Compilers 

Optimization techniques added to PA-RISC compilers result in the use of 
fewer machine instructions to handle program loops. 

by Vats a Santhanam 



Register reassociattoa is a code improving transformation 
t t Kit is applicable to program loops. The basic idea is to 
rearrange expressions fount I within loops to increase 
optimization opportunities, while preserving the results 
computed. In particular, register reassociation can expose 
loop-invariant partial expressions in which intermediate 
results can he computed outside lite loop body and 
reused within the loop. For instance, suppose the follow- 
ing expression is computed within a loop: 

Hoop_variant f loop.cpnstantj] + loop_canstant_2 

w r here loo p_va riant is a loop-varying quantity, and 
loop_constant_l and loop ._constam_2 are loop -in variant quanti- 
ties (e.g., literal constants or variables with no definitions 
within a loop). In the form given above, the entire expres- 
sion has to be computed within the loop. However, if this 
expression is reassociated as: 

(loop_constant_l + loop_cronstant_2S + Jaop_variant 

the sum of the two loop-invariant terms can be computed 
outside the loop and added to the loop-varying quantity 
within the loop. This transformation effectively eliminates 
an add operation from the body of the loop, t liven that 
the execution time of applications can be dominated by 
code executed within loops, reassociating expressions in 
the manner illustrated above dip have a very favorable 
perf o nuance impa efc 

The term "register reassociation" is used to describe this 
type of Optimization because the expressions that are 
transformed typically involve integer values maintained in 
registers. The transformation exploits nol just the associa- 
tive laws of arithmetic but also the distributive and 
commutative law r s. Register reassociation has also been 
described in the literature as subscript commutation. 1 

Opportunities to apply register reassociation occur fre- 
quently in code that computes the effective address of 
multidimensional array elements that an- accessed within 
loops. For example, consider the following FORTRAN 
code fragment, which initiaiizes a three-dimensional array: 

DO 100 i = 1 P DIM1 

DO 100 j = 1, DIM2 
DO 100 k = 1 f DIM3 
100 AftfrfdrP 

Arrays in FORTRAN are stored in column-major order, 
and by defaull, indexes fur each array dimension start at 
one, Rg, 1 illusl rates how array A would be stored in 



Increasing 

Memory 

Addresses 




Fig. I, i '■■ihimn-niajor storage layout for array A. 

memory. Given such a storage layout, the address of the 
array element A(i H j, kf is given by: 

ADDR (A(U.J)} + (k-1) x DIM2 x 0IM1 x element_size + 
(j-1) x DIM1 x element_srze + 
(i-1) x e]ement_size 

where ADDR (AU, 1,1)1 is the base address of the first ele- 
ment of array A, DIMn is the size of the nth dimension, 
and elemem_size is the size of each array element. Since 
the individual array dimensions are often simple integer 
constants, a compiler might generate code to evaluate the 
above expression as follows; 

({(({k x DIM2} i |) x DIM1J + I) - (II + 0IM2I x 0IM1 + If] x 
eiement.size + (ADDR (A(1,U)) (I) 
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Since the variable k assumes a different value for each 
iteration of ihe innermost loop in the above example, the 
entire expression is loop-variant. 

With suitable reassocialion, the address computation can 
be expressed as: 



ADDR(A(i r j,k)1 = « * k + \\ 



(2) 



whore a and ft are loop-invariant values thai can be 
computed outside the ioop, effectively eliminating some 
code from the innermost loop, From expression 1: 



and 



a = DIM1 x DIM2 x elemem_size 

p = [(J x DIM] + l| - f(l + D1M2) x DIM1 ♦ 1|] X 
element.stze + ADDR(A(1,1 H 1|). 



The simplified expression (u x k + |i) evaluates a linear 
arithmetic progression through each ileraiion of the 
innermost loop. This exposes an opportunity for another 
closely related loop opi imitation known as strength 
reduction. 21 * The basic idea behind this optimisation is to 
maintain a separate temporary variable thai tracks the 
values of Ihe arithmetic progression. By incrementing the 
temporary variable appropriately in each iteration, the 
multiplication operation can be eliminated from the loop. 
For our simple example, the temporary variable would be 
initialized to u + p outside the innermost loop and in- 
cremented by a each time through the loop. Tins concept 
is illustrated in Fig. 2, 

Note that this can be particularly beneficial for an archi- 
tecture such as PA- RISC in which integer multiplication is 
usually translated into a sequence of one or mme Instruc- 
tions possibly involving a mi Ilk ode library caD.1 

On some architectures, such as PA-RISC and the IBM 
RISC System/6000, register reassocialion and strength 
reduction can be taken one step fun her. In particular, if 
the target architecture has an automcrement addressing 
mode, incrementing the temporary variable that maintains 
the arithmetic progression can be accomplished automati- 
cally as a side effect of a memory reference. Through this 
additional transformation, array references in Loops can 
essentially be converted into equivalent, but cheaper, 
automcrementing pointer dereferences. 

An Example 

To clarify the concepts discussed so far. let us compare 
the PA- RISC assembly code for the above example with 
and without register reassocialion. Assume that the 
source code fragment for the example is contained In a 
subroutine in which the array A is a formal parameter 
declared as: 

REAL *4 A(1D H 2Q,30) 



The loop limits DIMT, Dlfv12 T and DIM3 take on the constant 
values 10, 20, and 30 respectively. The following assembly 
code was produced by the HP 9000 Series 800 HP-UX* 
3 .0 FORTRAN compiler at level 2 optimization with loop 
unrolling and reassociation completely disabled, ft 

1 

2 
3 
Q 
5 
6 
/ 
8 
9 
I0 
n 

7? 
13 
14 
If 



L0I 1,%r31 


i<- 1 


FCPY,SGL%frQL%fr4L 


f r4 <- 0.0 


L01 20,%r24 


r24 <- 20 


L0I 6GQ,%r29 


m <- 600 


iJoap_5tart 




L0I l ( %r23 


i<-i 


)_toop_start 




LDI 20,%r25 


t<k*2Q> <- 20 


k_1oop_stsrt 




: ADD %r25,%r23,%M9 


r19 <- t<k*2Q> * j 


: SH2ADD %rig r %r19.%r20 


r20 <- r19*5 


: SHlADD%r20 r %f3l,%r21 


r21 <- r20*2 + i 


: LDO -211(%r2lh%r22 


r22 <- r21 - 21 1 


: LOO 20i%r25),%r25 


t<k*20> <- t<k*2Q> + 20 


COMB, <± %r25,%r29,k_loop_start 


jft<k*20><^ r29 




go to kjoop. start 


: FSTWX r S %fr4L r %r22«0,%r26J 


*[r22*4 4 ADDR(An,l,lJ)i<-fr4 


: LDO K%r23),%r23 


i <- i + i 


COMB^-N %r23,%r24 H kJoop_start 


if j <= r24 




go to k_loop_start 


: LDI 20 H %r25 




: LDQ It%r31l%r31 


i <- i + 1 


: C0M!BF,<N 10 r %r3l,iJoap_start 


if i <= 1G go to j_3oop_start 


; LDI l,%r23 




: BV,N %r0(%r2) 


return 



<DIM3 \\ ■ ii 



{n-1|xa 



DO 100 k=t,QIM3 









Pk.l 



?k--2 



P k=n where 2<h<dim3 



Pk^DlM3 



Fig. 2. An iiJuMfjui'iii of i .isinftii temporary variable Phi rrad! Hi< 
s expression ak + p to stride through array A. 



T Removing multiplications from loops is beneficial even on version 1 ! of PA-HISC wtoch 
cefmes a fnted-point muitipjy instruction that operates on the floating -pom' register file 
Ta exploit this instruction, it may be necessary to transfer da! a values from genera! reg- 
isters To floating-point registers and ! hen hack again 



ft Tne Senes 800 HP-UX B.D FORTRAN comoiJer does not include the code generation and 
optimization enhancements implemented for The Series 700 FORTRAN : 
mg she code produced hy the Senes BOO FORTRAN compiler instead of ihe Series 70Q 
FORTRAN compter helps better isolate and trigtihgnt the full impact of regsste? 
reassociation 
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The labels i_laop_start, jJoop_start, and k_loop_start thai mark 
a start of a loop body and the annotations are not 
generated by the compiler but were added for clarity. 

Die following sections describe the optimizations per- 
formed in the above code segment- 
Constant Folding. Given that DM2 = 20 and DIM I = 10. the 
partial expression - Hi - DIMZ) x DJMl + II has been eva- 
luated to be -211 at compile time. 

Loop Invariants, Loop-invariant code motion has positioned 
instructions that compute loop-invariant values as far out 
of the nest of loops as possible. For instance, the defini- 
tion of the floating-point value 0.0 has been moved from 
the innermost loop to outside the body of all three loops 
where it will be executed exactly once (line 2). (On 
PA-RISC systems, when floating-point register is used as 
a source operand in an instruction other than a floating- 
point store, its value is defined to be 0.0.) 

IndeM Shifting, The innermost k-loop (lines 10 to 16) 
contains code that computes the partial expression ((((kx 
DIM2} + j} x DIM1J + i) whose value is added to -211 and 
stored in register 22 (line 13) before being used in the 
instruction that stores 0,0 in the array element. Register 
22 is used as the index register; it is scaled by four [to 
achieve the multiplication by the element size) and then 
added to base register 26, which contains ADDR (A|t,UM. 
These operations produce the effective address On the 
store instruction, 

The compiler has strength-reduced the multiplication of k 
by 0IM2 in line 14. A temporary variable thai Tracks the 
value of k x 2i> (referred to as t<k*2Q> in the annotations) 
has been assigned to register 25 in line 8. This temporary 
variable is used in the calculation or the address of Afi,j H k). 
By incrementing I he temporary variable by 20 on each 
iteration of the innermost loop, the multiplication of k by 
20 is rendered useless, and therefore removed. 

Linear Function Test Replacement After strength-reducing k 
X 20, the only other real use of I he variable k is to 
check for the loop termination condition flinr IB). 
Through an optimization known as iiriear Junction test 
replacement^ the use of the variable k in the Innermost 
loop termination check is replaced by a comparison of 
the temporary variable t<k*20> against 600, which is I he 
value of DIM3 Scaled by a factor of 20. This optimization 
makes I ho variable k superfluous, thus enabling iho 
compiler to eliminate the instructions that initialize and 
increment its value. 

Branch Scheduling. A final point to note is I hat I he loop 
termination checks for the i-loop and j-Joop arc per- 
formed using the nullifying backward conditional branch 
instructions in lines 18 and 2L In PA-RISC, the semantics 
of backward nullifying conditional branches are that the 
delay slot instruction (the one immediately following the 
branch instruction) is executed only if the branch is 
taken and suppressed otherwise. This nullification feature 
allows I he compiler to schedule the original largel of a 
backward conditional branch into its delay slot and 
redirect I he branch to the instruction Following the 
original tat 



In contrast to the i-loop and j-loop, the innermost loop 
termination check is a non-nullifying backward condition- 
al branch whose delay slot instruction is always executed, 
regardiess of whether the branch is taken. 

Applying Re association 

The most important point to note about the assembly 
code given above is that (he innermost loop, where much 
of the execution time will be spent, consists of the seven 

'ructions in lines 10 to 16. By applying register reas- 
sociation to the innermost loop, and making use of the 
base-register modification feature available on certain 
PA-RISC load and store instructions, the innermost loop 
can be reduced to three instructions. ^ 

The following code fragment shows the three instructions 
for the innermost loop. Registers rlU to r22 and 1V4 are 
assumed To be initialized with values indicated by the 
annotations. 

; Initialize general registers r!9 through r22 and floating-point 
; register fr4 

; fr4 <- 0.0 

; r!9 <- 1 

; r20 <- 30 

I r21 <- ADDR(A(ij r 1|J 

; r22 <- 600 
; The three innermost loop instructions 
k_laop_star1 

LDD H%rl9L%rl9 ; k <- k + 1 

C0MB r <= %r19,%r20 f k_loop_start; if k <= r20 go to k_loop_start 
FSTWKM %fr4L,%r22(0 r %r21) ; *[r21| <- fr4, ffl <- r21 + r22 

This assembly code si rides through the the elements of 
array A with a compiler-generated temporary pointer 
variable that is maintained in register 21. This register 
pointer variable is initialized to the address of A(i,i,1| 
before entry to the innermost loop and posl incremented 
by 800 bytes after I he value (M) is stored In A(i r j,k|, 

This code also reflects the real intent of the initialization 
loop, which is to initialize the array A hy striding through 
the elements in row major order. It does this in fewer 
instructions by exploiting PA - RISC s base-register modifi- 
cation feature. The equivalenl semantics lor the above 
inner-loop code expressed in C syntax is; 

for(p = &a|U,l>, k= 1; k <= 30; k++} 



{ 



} 



*p+4 = 0.0; 



The code sequence for the k-loop in the assembly code 
fragment can be improved even further. Note dial as a 
result of register reassociaiinii, the loop variable k, which 
is maintained in register lit, is now only used to Control 
i he iteration count of the loop. Using the linear function 
lest replacement optimization mentioned earlier, the loop 
variable can be eliminated, Specific ally, the loop termina- 
tion check in which the variable k in register 19 is 
compared against the value SO in regis! it 20 can be 
replaced by an equivalent comparison of I he Compiler 
generated temporary pointer variable (register 21} against 



rBase-regis;!- I loads and stores effectively 1 1 1 M# 1 1 e autoinaemBru 

addressing made desoibasi earlier 



June \W2 \U-\\ iHI-l'arkunJ Journal 35 



)Copr. 1949-1998 Hewlett-Packard Co. 



ihe address of the array element ailj^oi. This ran reduce 
the Innermost loop to just the following two instructions. 

FSTWX.M %fr4L,%r22(0 l %r21) ;^|r211 <- fr4,r21 <- r21 + r22 
kjoopstart ;The two innermost loop instructions 

C0MB,<=N %r21,%r20 f k_loop_start;lf r21 <= ADDR (AftiM)} go to 

;k_loap_start 
FSTWX,M %fr4t,%r22(0,%r21l ;*[r21l <- Hr21 <- r2Ur22 

Register r20 would have to be initialized to the address of 
A(i, j,30l. 

On PA-RISC machines, if a loop variable is only needed 
to control the loop iteration count, the loop variable can 
often be eliminated using a simpler technique. Specifically, 
the PA-RISC instruction set includes the ADDS (add and 
branch) and ADDIB (add immediate and branch) condition- 
al branch instructions. These instructions first add a 
register or an immediate value to another register value 
and then compare the resul! againsl zero to determine the 
branch condition. 

If a loop variable is incremented by the same amount on 
each iteration of the loop and if it is needed solely to 
check the loop termination condition, the instructions that 
increment the loop variable can be eliminated from the 
body of the loop by replacing the loop termination check 
with an ADDS or ADDIB instruction using an appropriate 
increment value and a suitably initialized general-purpose 
count register. 

For our small example, the innermost loop can be trans- 
formed into a two-instruction countdown loop using the 
ADDIB instruction as shown in the following code. 

LDI -29, %r19 ; initialize count register 

k_loop_start 
ADDlB r <= 1,%r19, kJoop_start ; r!9 <- rig + 1, if rl& <= go to 

; k_loop_start 
FSTWX.M %fr4L%r22(0 r %r21> ;*{r21] <- H rZI <- r22 + r21 

The j-loop and i-loop of our example can be similarly 
transformed. The increment and branch facility is not 
unique to PA-RISC, but unlike some other architectures, 
the gen era! -purpose count register is not a dedicated 
register, allowing multiple loops in a nest of loops to be 
transformed conveniently. 

Note that even though reassociation has helped reduce 
the innermost loop from seven instructions to two 
instructions, one cannot directly extrapolate from this a 
commensurate improvement in the runtime performance 
of this code fragment. In particular, the execution time 
for this example can be dominated by memory subsystem 
overhead (e.&, cache miss penalties) because of poor 
data locality associated with data assignments to array A. 

Compiler Implementation 

Register reassociation and other ideas presented m ihis 
article were described in the literature several years 



Strength reduction and linear function test replacement 

have been implemented in PA-RISC compilers from their 
very inception. The implementation of these optimizations 
is closely based on the algorithm described by Allen 
Cocke, and Kennedy- Register reassociation, on the other 
hand, has been implemented in the PA- RISC compilers 
very recently. The first implementation was added to the 
HP 9000 Series 700 compilers in the 8.05 release of the 
MPrx operating system Register ^association is enabled 
through the use of the +0S compiler option, which is 
supported by both the FORTRAN and C compilers in 
release S.05 of the HP-UX operating syslem. 

The implementation of register reassociation offered in 
HP-UX 8*05 is somewhat limited in scope. Register 
reassociation is performed only on address expressions 
found in innermost straight-line loops. The scope of 
register reassociation has been greatly extended in the 
compilers available with release 8.3 of the HP-UX operat- 
ing system, which runs on the Series 700 PA-RISC 
workstations. Register reassociation, which is now per- 
formed by default in Ihe C, C+ + , FORTRAN, and Pascal 
compilers at level 2 optimization, is attempted for all 
loops (with or without internal control flow) and not 
limited merely to straight -line innermost loops. Further- 
more, in close conjunction with register reassociation, 
these compilers make aggressive use of the PA- RISC 
ADDIB and ADDB instinct ions and the base-register modifi- 
cation feature of load and store instructions to eliminate 
additional inst ructions from loops as described earlier. 

Using the example given earlier, the following code is 
generated by the Series 700 HP-UX 8,3 FORTRAN com- 
piler at optimization level 2 (without specifying the +0P 
FORTRAN preprocessor option). 



1: FCPY,SGL 


%frOL,%fr4L 


; fr4 <- 0.0 


2: LDI 


&00,%r3l 


; Pijkjnc <- 800 


3: LDI 


-l%r23 


; i_cnt <- -9 


4: I_loop_stsrt 






5; COPY 


%r26,%r24 


; Pijl <- Pill 


6: LDI 


-19,%r25 


; i_cnt c- -19 


7l ]_loop_start 






8: COPY 


%r24,%r29 


; Pijk <- Pijl 


9: LDI 


-29 r %r19 


2 k^cnt <- 29 


10. k. laopstart 






It: ADDIB,<= 


I^rl&kjoop. 


.Start; k_cnt <- k_cnt 



12: FSTWX r M %fr4L%r31(0 r %r29) ; 
13: ADDlB r <= lj_loop_start j 

14; LDQ 4CH%r24) J %r24 

15: ADDIB,<= l H %r23j_loop_start ; 



16: LOG 
17: BVH 



4(%r26) P %r26 
%r0(%r2S 



if k_cnt <= go to k_foop_start 

*(Pijk) <- 0.0; Pijk <- Ptkj 4 800 

j_cnt <- Lent + 1 

if |_cnt <= go to j 1oop_start 

Pijl <- Pijl + 40 

Lent <- i_cnt * l; 

if Lent <= go to ijoop_start 

Pin <_ pin * 4 

return 



ago 



3AS,ti 



Compilers that perform this optimization include 



This code shows that register reassociation and strength 
reduction have been applied to all three loop nests. 



the DN 10000 HP Apollo compilers and the IBM compilers 
for the System 370 and RISC Sysiem/6000 architectures, "^ 7 
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k-Loop Optimization, For the k-loop, a compiler-generated 
temporary pointer variable Psjk, which is maintained in 
register 29. is used to track the address of the array 
element Alu,k), This temporary- variable is incremented by 
800 through base-register modi fl cation (line 12) instead of 
the original loop variable k incrementing by one. The 800 
comes from DIM! x DIM2 x element_size (10 x 20 x 4 
which represents the invariant quantify a m expression 2. 

Before entering the k-loop, Pijk has to be initialized to The 
address of Afi.j.lJ because the variable k was originally 
initialized to one before entering the k-loop. For this 
example, the address of A(i.j,ll can be computed as: 

[(((1 x 20) i- j| x 10) + i) - ((1 4 20) x 10 + I)] x 4 + 
ADDR(A{U/I)| 

which can be simplified to 

40 X j *((4 x i)-44 t ADDRfAilJJ))]. 

which for the j-loop is a linear function of the loop 
variable j of the form: 

aj + p 

where: 

a = 40 



and 



j-Loop Optimization. To strength-reduce the address expres- 
sion for Ai'i.j .11 the compiler has created a temporary 
variable Fiji, which is maintained in register 24. Wherever 
the original loop variable j was incremented by one, this 
temporary variable is incremented by 40 (line 14). Before 
entering the j-loop, Pij.1 has to be initialized to the address 
of Afi.lji since the variable j was originally initial i> 
one before entering the j-ioop. For this example, the 
address of Ad,UJ can be computed as 



< 20) + 1) x 10) + i) 
ADDR(AaU)) 



- 20) x 10 * ])] x 4 * 



which can be simplified to 

4xi + (ADDR|A|l f 1.1|| -4) 

which for the Hoop is a linear function of the loop- 
varying quantity i of the form; 

co + p 
where: 



a = 4 



and 



p = (AD0R(A(U,])}-4|. 



f, = j(4 x i}-44 t ADDRIAIUHH. 



Execulion 
Sequence 



Array Element Addresses Contained in Tempo ra 17 Variables 
Pill Pijl Pifk 



i=U=1 fc*1 .30 

i=1.j=2.k=1...30 

* 
• 
* 

* B f.j-2D.k-1 .30 


JWU.l] 

AlU.V 

■ 
■ 
• 

A1U11 


1 


Atl.1.1) 1 

J* 

a 
* 
■ 

AH .20.1) 


Ail 1 ||, All. 1,2 1 All, l.30i 

1 1 1 

AH .2.1 J „. A| 1.2,30] 

• * 

• * 

• ■ 

Al 1.20.11 ~ A| 1,20301 


i=2,j=l,k=1. 30 
i=£j c 2.k=1 .30 


A!2,Ul +- 
At2,Uf 








AI2.1/U 
At2,2,1l 


Att1 J) ... MIAM 
A\22A)-MU,m 






« 
i-i. f =2QM-\. 30 


* 
■ 

AI2X1I 


* 
• 


t • 
• * 
« » 

AO20.1) - A|2 F 20J0f 


# 
* 
• 


* 
* 
* 




• 
* 
* 


* • 

* * 








hl0,jstJ<=l._3O 
i=10,j=2,k=l 30 

• 


A'l til -1' 
AU0.UI 

» 


All 01,1] 

Ai 10.2.1; 

■ 


A|lO r l.l|, ... AflfcUDl 
AI10ZU ... Af10Z30J 

* * 






* 


* 


ft 


* * 


• 


i 


i 


■ * 


« = itl.|r.20fcTl 30 


A{10.1,1| 


AH0.ZD.1j 


A(10.20,1l ... A(10iD.30j 



Initially Pit 1 = Pijl m f>ijk « A0DH(A0.11)I 

'-V ADOR . A. 1.1,2>I = ADDR( Aj 1 .1 ,1 \\ + 800 Byte* 
? ] ADO R 1 Al 1 ,2. 1 N^ AD 0R[ A| 1 r 1 .1 ) I + 40 Byte* 
1 3 ADDR I M2.t.1)| - ADDftTA(1,l,1 I) + 4 Byl«s 



Fig. 3. Tin' :irni> 1 lemeru address- 
es Contained in earn of the- tempo* 
riii.'-. ■■■■:!...! 1 1 ■■ during different 
iterations qI the 1 1 ..ni |. |m, ,|:i 

- 1 Mi-- ■ \;irnp| If ("ragmen I 
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i-Loop Optimization. To strength-reduce the address expres- 
sion for Afi.UK the compiler has created a temporary 
variable Pill, which is maintained in register 26. Wherever 
the original loop variable i was incremented by one, this 
temporary variable is incremented by Tour (line If').). 
Before entering the i-loop. Pill has to be initialized to the 
address of A| 1,1,1 )■ since the variable i was originally 
initialized to one before entering the i-loop. 

The address of A( 1,1,1} is passed as a formal parameter to 
the subroutine containing our code fragment since by 
default t parameters are passed by reference in FORTRAN 
(which implies that the first argument to our subroutine 
is the address of the very first clement of array A), On 
PA-RISC machines, the first integer parameter is passed 
by software convention in register 26, and since Pill is 
maintained in register 26, no explicit initialization of (lie 
compiler-general cm! lemporary Pill is needed. 

The values assumed by all three compiler-generated 
temporary variables during the execution of the example 
code fragment given above are illustrated in Fig. 3, Note 
thai flu- elements of array A are initialized in row- major 
Older- Because FORTRAN arrays are stored in eolumn- 
ntajor order, elements of the array are not accessed 
contiguously. This could result in cache misses and thus 
some performance degradation. For FORTRAN this 
problem can be remedied hy using the st ride-1 inner loop 
selection transformation described on page 26. Tin* 
transformation examines nested loops lo determine if the 
loops ran be rearranged sn that a different loop can run 
as the inner loop. 

Loop Termination. The compiler has managed to eliminate 
all three of the original loop variables by replacing all 
loop termination checks with equivalent ADD IB instructions 
(lines 11, 13, and 15}, Coupled with the use of base- 
register modification, the innermost loop has been re- 
duced if j just two instructions (compared to seven 
instructions without reassociation). 

In implementing register reassociation (and exploiting 
architectural features such as base-register modification) 
in a compiler, several factors need to be taken into 
account. These include the number of extra machine 
registers required by the transformed instruction se- 
quence, the number of instructions eliminated from the 
loop T the number of inst met ions to be executed outside 
the loop (which can be important if the loop iterates only 
a few times), and the impact on instruction scheduling. 
These and other heuristics are used hy the HP-l'X 8.3 
compilers in determining whether and how to transform 
integer address expressions found in loops. 

Finally, the register reassociation phase of the compiler 
shares information about innermost loops (particularly 



base-register modification patterns! with the software 
pipelining phase. The pipelining phase uses this informa- 
tion to facilitate the overlapped execution of multiple 
iterations of these innermost loops (see "Software 
Pipelining in PA-RISC Compilers" on page 39). 

Conclusion 

Register reassociation is a very effective optimization 
technique and one that makes synergistic use of key 
PA- RISC architectural features. For ioop -intensive numeric 
applications whose execution times are noi dominated by 
memory subsystem overhead, register reassociation can 
improve run-time performance considerably, particularly 
on hardware implementations with relatively low dealing 
point operation latencies. 
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Software Pipelining in PA-RISC 
Compilers 

The performance of programs with loops can be improved by having the 
compiler generate code that overlaps instructions from multiple iterations 
to exploit the available instruction-level parallelism. 

by Sridhar Ramakrishnan 



In the hardware environment, pipelining is the partitioning 
of instructions into simpler computational steps that can 
be executed independently in different functional units 
like adders, multi pliers , shifters, and so on. Software 
pipelining is a technique that organizes instructions to 
take advantage of the parallelism provided by indepen- 
dent functional units. This reorganization results in the 
instructions of a loop being simultaneously overlapped 
during execution — that is, new iterations are initiated 
before the previous iteration completes. 

The concept of software pipelining is illustrated in Fig, 1, 
Fig. la shows the sequence of instructions that loads a 
variable, adds a constant, and stores the result. We 
assume thai the machine supports a degree of parallelism 
so that for multiple iterations of the instructions shown in 
Fig. la ? the instructions can be pipelined so that a new 
iteration can begin every cycle as shown in Fig, lb- Fig. 
lb also shows the parts of the diagram used to illustrate 
a software pipeline. The prolog is the code necessary to 
sri up the steady-state condition of (he loop. In steady 
State one iteration is finishing every cycle. In our example 
three iterations are in progress at the same time. The 
epilog code finishes executing all the operations that 



(a* 



1. LOAD 






2. ADD 






3. STORE 


Iteration 




Cycle 


1 2 3 




■ 
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i Prolog 

~ J 
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Ik, 

I Epilog 



w 

Fig. l. A software pipeline example, (ai The sequence qfinstFUi 
nuns in a bnMtage pipeline (of one itmsdMn) (jti) Multiple itera- 
tions d the iristnirt inns shown in (ai pipelined OW5T p&rfcUel 
execution coi$ponerit& 



were started in the steady state but have not. yet 
completed. 

This example also illustrates the performance improve- 
ment dial can be realized with software pipelining. In the 
example in Fig. 1 the pipelined implementation completes 
three iterations in five cycles. To complete the same 
number of iterations without pipelining would have taken 
nine cycles. 

Loop Scheduling 

As instructions proceed through the hardware pipeline in 
a PA-RISC machine, a hardware feature called pipeline 
interlock detects when an instruction needs a result that 
has yet to be produced by some previously executing 
instruction or functional unit. This situation results in a 
pipeline stall It is the job of the instruction scheduler in 
I he compiler to attempt to minimize such pipeline stalls 
by reordering the ins truet ions. The following example 
shows how a pipeline stall can occur. 

For the simple loop 

for ( J = 0; I < N; I = I 4 1 | i 

A[l] = All! + C*BW 
} 

the compiled code for this loop (using simple pseudo 
instructions) might look like: 

for f I = 0; I < N; I - I + 1 ) { 
tOAD B[ll R2 
LOAD A|l|, R1 
MULT C, R2, R3 



ADD R1 ( Rl R4 



R3 = C*R2 

; pipeline stalls R3 needed 
R4 = Rl + R3 

; pipeline stalls R4 needed 



STORE R4 r All] 



} 



Assume that for this hypothetical machine the MULT, ADD, 
STORE, and LOAD instructions take two cycles each. We 
will also assume in this example and Ihroughoul ihis 
paper thai no memory access suffers a cache miss. Fig. 2 
illustrates how the pipeline stalls when the ADD and STORE 
instructions must delay execution tint i I the values of R3 
and R4 become available. Clearly, this becomes a serious 
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Slarl of Instruction 
End of Instruction 



Sq = LOAD S|J] 

E Q - R2 Contains B[f| 

Si - LOAD m 

Ei = fli Contains A[tJ 

S 2 ~ Begin MULT 

E, a R3 Contains Result 

S« = Begin ADD 

E: M Contains Rl + R3 

S$ - Begin STORE 

E & - R4 Contains A[l]| 



Fig* 2. Ail illustration of a pipeline stall. 



problem if the MULT and ADO instructions have multiple- 
Cycle latencies fas in floating-point, operations), ff we 
ignore the costs associated with the branch instruction in 
the loop, each iteration of the above loop would lake 
eight cycles. Put differently, the processor is stalled for 
approximately 25% of the time (Uvo cycles out of every 
eight). 

One way of avoiding the interlocks in this example is to 
insert useful and independent operations after each of the 
MULT and ADD instructions. For the example above there 
are none, This problem can be solved by unrolling the 
loop a certain number of times as follows; 

for { I = 0; I < N; I = I + 2 ){ 

A[l] = A[l] ♦ C*B[I] 

A[l+1] = AU+ll * C*B[M] 
} 

Notice that the loop increment is changed from one to 
two to take into account the fact that each time the loop 
is entered we now perform two iterations of the original 
loop. There is additional compensation code* that is not 
shown here for the sake of simplicity. 

The best schedule for this loop is as follows: 

for f I = 0; I < N; I = I + 2 ) { 
LOAD BUI, R2 
LOAD AMI. Rl 

MULT C r R2 r R3 
LOAD B[I+1],R6 
ADD Rl, R3, R4 
LOAD Afl + 11. R5 
MULT C, R6, R7 
STORE R4. A[l] 
ADD R5, R7, RE 

*ln ttae context of software pipelining, compensauon code refers to :he code "tiat sets up trie 
steady state and the coda that executes aftarcampletJQrr of the steady state Compensate 
code js discussa ..rricie 



:stall R8 needed 
STORE RB, A[I+T1 

} 

If we assume perfect memory access on the LOADs ami 
STOREs, this schedule will execute two iterations every 1 1 
cycles (again, ignoring the costs associated with the 
branch instruction), Fig. 3 show T s what happens during 
each cycle for one iteration of I he unrolled loop. 

Despite the improvement made with loop unrolling, there 
are three problems with this technique. First, and perhaps 
most important is that the schedule derived for a single 
lU-raiiun of an unrolled Inop dues no! take into account 
the sequence of inst met ions that appears at the beginning 
of the next iteration. Second, we have a code size expan- 
sion that is proportional to the size of the original loop 
and unroll factor. Third, the unroll factor is determined 
arbitrarily. In our example, the choice of two for the 
unroll factor was FortuUous since the resulting schedule 
eliminated one stall. A larger unroll factor would have 
generated more code than was necessary. 

Software pipelining attempts ro remedy some of the 
drawbacks of loop unrolling. The following code is a 
software pipelined version of the above example (again, 
we do not show the compensation code): 

for | I ^ 0; f < N; U 1 + 4 1 ( 



LOAD BLU3L R14 
LOAD A[k3] f R13 
MULT C, RIO, Rtl 
ADD R5, R7, RB 

STORE R4, All] 



; start the fourth iteration 

; start the fourth iteration 

; Rll = C * SI 1+2] (third iteration! 

; AU+11 = A[l+1] + C * B[l+1] (second 

; iteration) 

; finish the first iteration. 



} 



Fig. 4 show r s tlie pipeline diagram for this example. 

Nut ice that in a single iteration of the pipelined loop, we 

are performing operations from four different iterations. 



40 Juiu- 1902 Hewlett Packard JoumaJ 



)Copr. 1949-1998 Hewlett-Packard Co. 















Cycles 
















1 ' 


1 « 


2 


3 


4 


5 


E 


7 


8 


10 


11 


LOAD HTFl R2 i 


* 


























LOAD All! R1 


• 


fc. 


J 


















|? 


IF 


MULT c. az. m 




4 


i 




















r 


If 


LOAOBEI ■ U R6 
ADDR1 B3.R4 
LDAOAjl-HRS 








4 


i 




















i 


1 


l 


i 


















1 


r 

i 


l 
1 


P 

i 


1 










MULT C, R6, F)7 










1 


w 

1 


1 


P 


i 










^ 


P 


t 


P 


STORE R4. A [1] 














J 


1 










ADD R5. R7, R6 














V 


P 

1 


1 * 


1 




STORE Rl AN- 1) 
















% 


r 


IP 
a m a * 


J 1 




.... 

Stall 




*i 



























Hour von each successive iteration of the pipelined loop 
would destroy the values stored in the registers thai are 
needed three iterations later. On some machines, such as 
the Cyd rao. this problem is solved by hardware register 
o ■naming . ] In tin- PA-RISC compilers, ibis problem is 
solved by unrolling I he steady-state code u times, where 
u is the number of iterations simultaneously in flight. In 
Pig, 1. u is tour, since there are four iterations executing 
simultaneously during cycles 6 and 7. 

Software pipelining is not without cost. First, like loop 
unrolling, it has the code size expansion proolem Second 
there is an increased use ol registers because each mw 

iteration uses new registers Co store results, if this 
Increased re$stei a$e cannoi fee n m -k by the available 
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ii ~ Inttialion Interval (The Number ol Cycles Before n New Iteration 
Can Be Started] 

Fitf. 4. \ software pipeline llagram ol the i sample c©# Fragment, 



Fig, 3, One iteration Ol the example 

iti Fig. j ii- 1 Mr code fe unpolled to 

exfcfettti ' :'iijn> rfthe pre- 

vious o. J ir, ofti ii i raLinn. 



supply of registers, the compiler is forced to generate 
"spill" code, in which results are loaded and stored into 
memory, The compiler tries to ensure that this does not 
happen Third, ihe PA- RISC compilers will not handle 
loops that haw control How in them (lor example, a loop 
containing an if-then statement), 

However, unlike unrolling in whirl) the unrolling factor is 
arbitrarily determined, the fart or h\ which the steady- 
state code is unrolled in software pipelining is determined 
algoriilmiicairy (Hlfe will be explained in detail later) A 
key advantage of software pipelining is that the instnu 
tion pipe line tilling and draining process occurs once 
outside Ihe loop during the prolog and epilog section of 
the code, respectively; During this period, the loop does 
not run with the maximum degree of nverlap among the 
iterations. 

Pipeline Scheduling 

Tb pipeline a loop consisting of N instructions, the 

following questions must be answered: 
► What is the order of the N instructions? 
* How frequently (in cycles) should the new iteration be 

initiated? (This quantity is railed the initiaUon hitenni. 

or ii.) 

Conventional scheduling techniques address just the In st 
question 

The goal of pipelining is to arrive al a minimum value of 
it because we would like to initiate iterations as frequent- 
ly ;ls possible. This section will provide a brief discussion 
about how the value of ii is determined in the PA-RISC 
compilers. More information on this subject is provided in 
reference 2. 

The scheduling process is governed by two kinds of 
constraints: resource constraints and precedence 
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Fig, 5, An example of a res rvntimi rail* for the 

FMPV jnMni- n- n 

constraints. Resource constraints stem front the fact that 
a given niac hine has a finite number of resources and the 
resource requirements of a single iteration should not 
exceed the available resources. If an ins trued on is sched- 
uled at cycle x, we know that the same instruction will 
also execute at cycle x + ii f cycle x + (2 x ii), and so on 
because iterations are initiated every ii cycles. For the 
example shown in Fig. 4 ii is two. 

In PA-RISC compilers we build a resource reservation 
table associated with each instruction. For example, the 
instruction: 

FMPY f DBt frl , fr2 H fr3 ; fr3 = frl * fr2 

would have the resource reservation table shown in Fig. 5 
tor the HP 9000 Series 700 family of processors. The 
reservation table defines the resources used by an 
instruction for each cycle of execution. For example, the 
FMPY instruction modeled in Fig. 5 requires the floating- 
point multiplier mid the target register (fr3) during its 
second cycle of execution. The length of the table is 
dependent on the latency of the associated instruction. 

Precedence constraints are constraints that arise because 
of dependences in the program. For example, for the 
instruct ions: 

FMPY frUrZ. fr3 
FADD fr3, fr4, fr2 

there is a dependence from the FMPY instruction to the 
FA 00 instruction, Also t there is a dependence that goes 
front the FADD to the FMPY instruction because the FMPY 
from the next iteration cannot start until the FADD from 
the preceding iteration completes. Such dependencies can 
be represented as a graph in which the nodes represent 
machine instructions and the edges represent the direc- 
tion of dependence (see Fig. 6 ). The attributes on the 
edges represent 
• d: a delay value (in cycles) from node u to node v This 
value implies that to avoid a stall node v can start no 
earlier than d cycles after node u starts executing. 



p: a value that represents the number of iterations before 
the dependence surfaces (i.e> t minimum iteration dis- 
tance).* This is necessary because we are overlapping 
multiple iterations. A dependence that exists in the same 
iteration will have p - (FADD depends on fr3 in Fig. 6). 
Values of p are always positive because a node cannot 
depend on a value from a future iteration. Edges that 
have p = are said to represent Intra-iteration depen- 
dences, while nonzero p values represent inter-iteration 
dependences. 

Given an initiation interval, ii, and an edge with values 
<p t d> between two nodes u and v T if the function S(x) 
gives the cycle at which node x is scheduled with respect 
to the start of each iteration, we can write: 

S(v) - S(u) > d(u,v) - ii x p(u ( v j (I) 

If p(u, v) = then: 

S(V) - S(u)> d(u, v). 

Equation 1 is depicted in Fig. 7. 

The goal of scheduling the N instructions in the loop is to 
arrive at the schedule function S and a value for ii. This 
is done in the following steps: 

1. Build a graph representing the precedence constraints 
between the instructions in the loop and construct the 
resource reservation table associated with each of the 
nodes of the graph. 

2. Determine the minimum value of the initiation interval, 
(Mil) based on the resource requirements of the N 
instructions in the loop. For example, if the floating-point 
multiply unit is used for 10 cycles in an iteration, and 
there is only one such functional unit, Mil can be no 
smaller than 10 cycles. 

3. Determine the recurrence nihiimum initiation interval 
RMI1. This value takes into account the cycles associated 



Inter- Ite rati on 



Intra -Iteration 




l.d = 3> 



Fig. 6. A dependency graph. 



* The p values are sometimes called omega values in ttie literature on software pipelining. 
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RMII is the maximum value of ii for all cycles c in a 
graph. 

4. Determine the minimum initiation interval niin_ii 1 which 
is given by; min_ii = maxfMII, RMll.j. 

5. Determine the maximum initiation interval, max_ii. This 
is obtained hy scheduling the loop willmtil worrying 
annul llif inler-ileration dependences. The length of such 

a schedule tfives max_ii. If we go hack to our first 
example, we can see thai we had a schedule length of 

* [x] represents the smallest integer tli at is greater than or equal to x For en ample, [5/31 
. mti \k 1 = 2 



eight cycles. There is no advantage if we initiate new 
iterations eight or more cycles apart. Therefore, eight 
would be a proper upper bound for the initial ion interval 
in that example. 

6. Determine the value of ii by iterating in the range 
[min_ii max_ii]. 

6.1 For each value, ii, in the range min_ii to max_Ji do 
the following: 

6.2 Pick an unscheduled node x from the dependency 
graph and attempt to schedule it by honoring its prece- 
dence and resource constraints. If it cannot be scheduled 
within the current ii. increment ii and start over. 

If the unscheduled node can be scheduled at cycle m, set 
S(x) - m and update a global resource table with the 
resources consumed by node x starting at cycle m. This 
global resource table is maintained as a modulo ii 
table — that is, the table wraps around. 

6.3 If all the nodes have been scheduled, a schedule 
assignment S(x) and a value for ii have been successfully 
found and the algorithm is terminated. 

Otherwise, go hack to step 6.2. 
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Fig* 9. Loop transformation to add compensation code to make 
sure Uit' pipeline loop vxw uitif? the s;n;ir rmmbor uffimcs as the 
original loop. Block B represents the original loop and the ease in 
uliirJi the loop does not execute enough times to reach the pipe- 
line. 

There are three important points about this algorithm thai 
need to be made; 

• The length of a pipeline schedule (LP) may well exceed 
max_ii. 

• Since we iterate up to maxji, we are guaranteed that the 
schedule for the steady state will be no w'orsp than a con- 
ventional schedule for the loop, 

• Step 6.1 is difficult because it involves choosing the best 
node to schedule given a set of nodes that may be avail- 
able and ready to be scheduled. The choice of a priority 
function thar balances considerations such as the im- 
plication for register pressure if this node were to be 
scheduled at this cycle, the relationship of this node on 
the critical path, and the critical resources used by this 
node, is key and central to any successful scheduling 
strategy. 

Given LP and ii, w r e can now determine the stage count 
(sc), or number of stages into which the computation in a 
given iteration is partitioned (see Fig. 8). This is given by: 



The stage count also gives us the number of times we 
need to unroll the steady-state code, This is the unroll 
factor mentioned earlier 

There are two observations about equation 2 + First, to 
guarantee execution of the prolog, steady -state, and epilog 
portions of the code, there must be at least (2 x sc) - 1 
iterations. Second, once the steady-slate code is entered, 
there must be at least sc iterations left. In the pipeline 
diagram shown in Fig. 8 there must be at least five 
iterations to get through the pipeline (to get from a to b), 
and there must be three iterations left once the steady- 
state portion of the pipeline is reached. 

Several different ways are available for generating the 
compensation code necessary to ensure the above condi- 
tions. For example, consider the following simple loop. 



for { i 

B; 
} 



1; i <= T; i++ } { 



Mere B represents some computation that does not 
involve conditional statements. This loop is transformed 
into; 

T = number of times loop executes 



if ( T < 2 * sc 

M =T - sc; 

goto jump_out; 
end if; 

M = T- (sc -1) 



prolog; 
steady_5tate; 



epilog; 

M = M + sc 

jump^out 

for (i - 1; i <= 



i 



1) then /*Are there enough iterations 
*to enter the pipeline? 
*No. */ 



/*sc-1 iterations are completed 
*in the prolog and epilog */ 

/*Each iteration of the steady state 
* decrements M by sc and the steady 
*state terminates when M < Q */ 



/^Compensation code executes 

* M times. If M - r this 

* loop does not execute */ 



This trans form at ion is shown in Fig. 9. The compensation 
code in this code segment ensures that all the loop 
iterations specified in the original code are executed. For 
example, If T - 100 and sc = 3, the number of compensa- 
tion (or cleanup) iterations is 2. As illustrated in Fig. 10, 
the prolog and epilog portions take care of 2 iterations, 
the compensation code takes care of 2 iterations, and the 
steady-state portion handles the remaining 96 iterations 
(which means that the unrolled steady state executes 32 
times). 

A Compiled Example 

Software pipelining is supported on the HP 9000 Series 
700 and 800 systems via the -0 option. Currently, loops 
that are small (have fewer than 100 instructions) and 
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have no control flow (no branch or call statements) are 
considered for pipelining, 

The following example was compiled by the PA-RISC 
FORTRAN compiler running on the HP-UX 8.0 operating 
system: 

do 1D* = 1. 1000 

^ i| x(i) * yWJ + a) * b 

10 continue 

where x, y t ** 8* and b are all double-precision quantities. 
The compensation code is not shown in the following 
example. 

Without pipelining, the PA-RISC code generated is: 

SO00O0015 

FLDDX.S %29|0 f %23l f %fr4 ; W = xfi); 

FLDDX.S %29(0,%26).%fr5 ; fr5 * y(i}; 
; ## 1 cycle Interlock 

FMPY H DBL %fr4 i %fr5,%fi6 ; frfi = x(i)*y(i); 

■ ## 2 cycle Interlock 
FADD,DBL %fr6 P %fr22,%fr7 ; frl * x(i)*vffka; 

■ ## 2 cycle Interlock 
FMPY.DBL %fr7,%fr23 r %fr8 ; frB = |x(rt*y{rt+al*b 

; ## 1 cycle Interlock 

FSTDX,S %fr8,%29(G r %25) ; store z(i) 
LDO 1{%29),%29 ; increment i 

C0MB ( <= i N %2a i %G J SOD0DD01 5+4 
FLDDX.S %29(0,%23) J 1O4 ; r104 = xfi+1); 




> Prolog 



Steady Stele 
sc -3 



• Epilog 



T 100 Iterations 

Prolog + Epilog -2 Ha rations 

Slag* Count - 3 

Compensation Iterations C - IT +- 1 - sc) Mod sc - 2 (Only If Pipeline Executes) 

Iterations m Steady State — 4100 (Prolog + Epilog - C))/sc 
-1100-41/3-32 

Fijf, 10. An Illustration ofhoM kx$ iterations lajre Evicted smong 
the portions of a pipelined loop. 



K we assume perfect memory accesses, each iteration 
takes 14 cycles. Since there are six cycles of interlock, 
the CPU is stalled 43% of the time. 

Using the pipelining techniques of loop unrolling and 
instruction scheduling to avoid pipeline stalls, the PA- 
RISC code generated is: 



mm 

COPY %f26 r %r20 
LDO 1(%r26),%r26 
FLDDX.S %r20(0 p %r23} r %fr11 

FLDDKS %r2Q{&\%r25),%fr12 
FADD,DBL %fr5 i %fr22,%frl3 
FMPY.DBL %fr1h%fr12 r %fr7 
FSTDX.S %M.%r29jO,%r24l 
FMPY,DBL %frT3.%fr23 r %trB 
COPY %r26 P %r29 
LDO I{%r26),%r26 
FLDDX,S %r29{0,%r23),%fr9 
FLDDX H S %r29(Q,%r25| P %fr1Q 
FADDJ3BL %fr7,%fr22 r %tr14 
FMPY r DBL %fr9 i %fr10,%fr6 
FSTDX,S %ff8,%rl9(Q,%r24) 
FMPY,DBL %fr!4.%fr23,%frS 
COPY %r26,%r19 
LDO 1(%r2GL%r26 
FL0DX.S %rl9(0 J %r23L%frS 
FLDDX.S %rl9{0,%r25),%fr1f 
FAQaOBL %fr6 J %fr2Z%fr7 
FMPY.DBL %frB i %frl1,%fr5 
FSTDX H S %fr9,%r20(0,%r24) 
FMPY,DBL %ff7,%fr23,%fr4 
COMB.^.N %r26,%f4,L$9Q0D+4 
COPY %r26 r %r20 



; r20 ^ i - 2; 

; r26 = i + 3; 

;frt1 = x(i+2l; 

;frl2 = y(ir2J, 

; frl 3 = x(i+1)*y(i+lHa; 

; fr7 = x<i+2)*y<i+2): 

; store i\\) Result; 

; frB = jx|ifli*y(i+1)+a)*b 

; r29 = i + 3; 

; r26 = j + 4; 

; fr9 = xfi+31; 

; fr10 = y{i+3); 

; fr!4 = x{i42}*y(i+2)+a; 

. m = x|i+3|*Y0+3); 

; store z{i+1} Result; 

; fr9 = (xO+2)*y{i+2)+a}*b 

; rlS = i + 4; 

; r26 = i + 5; 

. frB - x{i+4}; 

;fr11 ^x|i+4); 

: fr7 = x(i+3}*y{i+3)+3; 

; frfl = xM}*y(i+4); 

; store z(i+2] Result; 

;fr4 = (x(i+3)*y(i+3)+a)*b 

; are we done? 



This loop produces three results every 26 cycles which 
means that an iteration completes every 8.67 cycles. Since 
there are no interlock cycles we have 100% CPU utiliza- 
tion in this loop. Since it takes 14 cycles per Iteration 
without pipelining, there is a speedup of approximately 
38% in cycles per iteration with pipelining. 

Another optimization technique provided in PA-RISC 
compilers, called register reassociation, can be used with 
software pipelining to generate belter code because 
during steady state it uses dilTerenl base registers Tor 
each successive iteration. See I he article on page 3J1 for 
more on register reassociation. 

Ackno wl edg em en ts 

The author would like to thank Manoj Dadoo, Monica 
Lam t Meng Lee, Bruce Olsen, Bob Rau, Mike Schlansker. 
and Cam] Thompson for their assistance with Ibis project. 

References 

1. B K. Rau, efcal, "The Cydra-5 Departmental Supercomputer," Com- 
putet a January 1980, pp, 12-35. 

2. M. Lam K "Software Pipelining: An Ef far live Scheduling Technique 
Tor VLJW Machines," Prmeedmf/s ttfthe &IGPLAN l 88 Conference 
Wl F*rotp(t m tn i tig Lt\ titfimgt? Desi&ti < i tod l"n '■' June 



.Iiiik* [B92Hewl(*u-Pa<.-kiirH.}mimHl 45 



)Copr. 1949-1998 Hewlett-Packard Co. 



Shared Libraries for HP-UX 

Transparency is the main contribution of the PA-RtSC shared library 
implementation. Most users can begin using shared libraries without 
making any significant changes to their existing applications. 

by Gary A. Coutant and Michelle A, Ruscetta 



Multiprogramming operating systems have long had the 
ability lo share a single copy of a program's code among 
several processes. This is made possible by the use of 
pure code, that is, code thai does not modify itself. The 
compilers partition the program into a code segment that 
can be protected against modification and a data segment 
that is private lo each process. The operating system can 
then allocate a new data segment to each process and 
share one copy of the code segment among them all. 

This form of code sharing is useful when many users are 
each running the same program, but it is more conunon 
for many different programs to be in use at any one time. 
In this case, no code sharing is possible using this simple 
scheme. Even two vastly different programs, however, are 
likely lo contain a significant amount of common code. 
Consider two FORTRAN programs, each of w T hich may 
contain a substantial amount of code from the FORTRAN 
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Fig. 1. Library implementations, (a) Archive library in uhicJi <■,-.< h 
program has i t r- own copy of the library code, (b) A shared library 
implementation in which one copy of the library is shared bnween 
programs. 



run -11 me library — code that could be shared under the 
right circumstances. 

A shared library is a collection of subroutines that can be 
shared among many programs, Instead of containing 
private copies of the library routines it uses, a program 
refers to the shared library. With shared libraries, each 
program file is reduced in size by the amount of library 
code that it uses, and virtual memory use is decreased lo 
one copy of each shared library's code, rather than many 
copies bound into every program file. 

Fig, la shows a library scheme in which each program 
contains a private copy of the library code (Jibe). This 
type of library implementation is railed an archive library. 
Note that the processes vil and v\2 share the same copy 
of the text segment, but each has its own data segment. 
The same is true for Isl and Is 2, Fig. lb shows a shared 
library scheme in which one copy of the library is shared 
among several programs. As in Fig, la, the processes 
share one copy of their respective text segments, except 
that now the library portion is not part of the program's 
text segment. 

Shared libraries in the HP-UX* operating system were 
introduced with the HP-UX SO release which runs on the 
HP 9000 Series 300, 400 t 700, and 800 workstations and 
systems. This feature significantly reduces disk space 
consumption, and allows the operating system to make 
better use of memory. The motivation and the design for 
shared libraries on the Series 700 and 800 PA-RISC 
workstations and systems are discussed in this article. 

How Shared Libraries Work 

Traditional libraries, now distinguished as relocatable or 
archive libraries, contain relocatable code, meaning that 
the linker can copy libran routines imo the program, 
symbolically resolve external references, and relocate the 
Code to its final address in the program. Thus, In the final 
program, references from the program to library routines 
and data are statically bound by the linker (Fig. 2a). 

A shared library, on the other hand, is bound to a pro- 
gram at run time (Fig. 2b). Not only must the binding 
preserve the purity of the library's code segment, but 
because the binding is done at run time, it must also be 
fast 

With these constraints in mind, we consider the following 
questions: 

L How does the program call a shared library routine? 
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Fig. 2* Binding libraries to programs, (a) In relocatable or arrliiw 
Hfafl&rtes, the linker binds the prc^gKUaTs .0 files and the referenced 
library files to create the executable a.out file. When a.out is run i In 
loader ereat.es 11k '■> ►re ilp4g6 and nms the program, (b) For 
shared libraries, the linker creates an incomp'lfi'' execut&ible (Bfe 
(the library routines are not. botmd into the a out file at link time ) 
The shared librar> r routines are dymnucaily loaded into t.he pro- 
grant s adi rim time* 

2. How does the program access data in the shared 
library? 

3. How does a shared library routine call another routine 
in the same library? 

4. How does a shared library routine access da I a in the 
same library? 

5. How does a shared library routine call a routine in 
another shared library (or in the program)? 

0. How does a shared library' routine access ciaia in 
another shared library (or in the program)? 

Linkage Tables, These questions can be answered several 
ways. The simplest lechnique is to bind each shared 



library to a unique address and to use the bound address- 
es of the library routines in each program that references 
the shared library. This achieves the speed of static 
binding associated with archive libraries, but it has three 
significant disadvantages: it is inflexible and difficult to 
maintain from release to release, it requires a central 
registry so that no two shared libraries (including third- 
party libraries.) are assigned the same address, and it 
assumes infinite addressing space for each process. 

Instead, we use entities called linkage tables to gather all 
addresses that need to be modified for each process. 
Collecting these addresses in a single table not only 
keeps the code segment pure, but also lessens the cost of 
the dynamic binding by minimizing the number of places 
that must be modified ar run time. 

All procedure calls into and between shared libraries 
(questions 1 and 5) are implemented indirectly via a 
procedure linkage table (PLT), In addition, procedure 
calls within a shared library (question §Q are done this 
way to allow for preemption [described later). The 
program and each shared library contain a procedure 
linkage table in their data segments. The procedure 
linkage table contains an entry for each procedure called 
by that module (Fig. 3). 

Similarly, a shared library accesses its data and other 
libraries 7 data (questions 4 and 6) through a data linkage 
table (DLT). This indirection requires the compilers to 
generate indirect loads and stores when generating code 



Program dr Procedure 



Shared Library 




OtT 



PLT 



PIT - Procedure Linkage Table 
DLT = Date Linkage Table 



Fig. 3* Linkage tables prnvidi- the link between a program or pro- 
cedure and tin- shared library nullities. Tin- ptX$$\ lmv tjtnfcagg 
table (PLT) contates pointers to routines refen ne$d in mi a pro* 
grarfli a procedure; or a shared library routine. The data linker 
table (DLT) contains pointers that provide a Chitted library wiih 
access i" Its uvm data ;j;- ; well as other librarie ' data 
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for a shared library, which means that shared library 
routines must be compiled with the appropriate compiler 
option. 

Indirect access to data is costly because it involves an 
extra memory reference for each load and store. We did 
not want to force all programs to be compiled with 
indirect addressing for all data T nor did we want the 
compilers attempting to predict whether a given data 
reference might be resolved within the program itself, or 
within a shared library. 

To deal with these data access issues we chose to satisfy 
all data references from the program to a shared library 
(question 2) by importing the data definitions from the 
shared libraries statically (that is, at link time). Thus, 
some or all of a shared library's data may be allocated in 
the program's data segment, and the shared Library's DLT 
will contain the appropriate address for each data item. 

Binding Times. To bind a program with the shared libraries 
it uses, the program invokes a dynamic loader before it 
does anything else. The dynamic loader must do three 
things; 

1. Load the code and data segments from the shared 
libraries into memory 

2. Resolve all symbolic references and initialize the 
Linkage tables 

3. Modify any absolute addresses contained within any 
shared library data segments. 

Step 1 is accomplished by mapping the shared library file 
into memory, Step 2 requires the dynamic loader to 
examine the linkage tables for each module (program and 
shared libraries), find a definition for each unsatisfied 
reference, and set the entries for both the data and 
procedure linkage tables to the appropriate addresses. 
Step 3 is necessary because a shared library's data 
segment may contain a pointer variable that is supposed 
to be initialized to the address of a procedure or variable. 
Because these addresses are not known until the library 
is loaded, they must be modified at this point. ("Modifica- 
tion of the code segment would make it impure, so the 
code segment must be kept free of such constructs.) 

Step 2 is likely to be the most rime-consuming, since it 
involves many symbol table lookups. To minimize the 
stun up time associated with programs that use shared 
libraries, we provide a mechanism called deferred bind- 
ing. This allows the dynamic loader to initialize every 
procedure linkage table entry with the address of an 
entry point within the dynamic loader. When a shared 
library procedure is first called, the dynamic loader will 
be invoked instead, at which time it will resolve the 
reference, provide ihe actual address in the linkage table 
entry, and proceed with the call. This allows the cost of 
binding to be spread out more evenly over the total 
execution time of the program, so it is not noticed. An 
immediate binding mode is also available as an option. 
Deferred and immediate binding are described in more 
detail later in this article. 

Position Independent Code. Because it is essential to keep a 
shared library's code segment pure, and we don't know 



where it will be loaded at run time, shared libraries must 
be compiled with position independent code. This term 
means that the code must not have any dependency on 
either its own location in memory or the location of any 
data that it references. Thus, we require that all branches, 
calls, loads, and stores be either program-counter (pc) 
relative or indirect via a linkage table. The compilers 
obey these restrictions when invoked with the +z option. 
However, assembly-code programmers must be aware of 
these restrictions, 

Branches within a procedure and references to constant 
data in the code segment are implemented via pc -relative 
addressing modes. The compiler generates pc-relalive code 
for procedure calls, but the linker then creates a special - 
purpose code sequence called a stub, which accesses the 
procedure linkage table. Loads and stores of variables in 
the data segment must be compiled with indirect, addressing 
through the data linkage table. 

The linkage tables themselves must also be accessible in 
a position independent manner For the PA- RISC architec- 
ture, we chose to use a dedicated register to point to (be 
current procedure and data linkage tables (which are 
adjacent), while on the Motorola 68000 architecture, we 
use pc-reiative addressing to access the linkage tables. 

Shared Library Trade-ofTs 

The motivation for shared libraries is that program files 
are smaller, resulting in less use of disk space, and 
library code is shared, resulting in less memory use and 
better cache and paging behavior. In addition, library 
updates automatically apply to all programs without the 
need to recompile or relink. 

However, these benefits are accompanied by costs that 
must be considered carefully. First, program startup time 
is increased because of the dynamic loading that must 
take place. Second, procedure calls to shared library 
routines are more costly because of I he linkage table 
overhead. Similarly, data access within a shared library is 
slower because of the indirect addressing. Finally, library 
updates, while seeming attractive on the one hand, can be 
a cause for concern on the other, since a newly intro- 
duced bug in a library might cause existing applications 
to stop working, 

Design Goals for HP-UX Shared Libraries 

When we fust began designing a shared library facility for 
the HP-UX operating system, AT&T's System V Release 3 
was the only UNIX* operating system implementation of 
shared libraries. Sun Microsystems released an imple- 
mentation in Sim OS shortly afterwards, 1 We also investi- 
gated a few other models including; Multics,^ VAX/VMS, 4 
MPE V and MPE XL, 5 AIX, 6 and Domain/OS, 7 While 
AT&Ts scheme requires static binding as well as a 
mechanism for building shared libraries, the others are all 
based on some combination of indirection and position 
independent code. 

None of the existing models offered what we considered 
to be our most important design goal — transparency. We 
fell that the behavior of shared libraries should match the 
behavior of archive libraries as closely as possible, so 
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that most programmers couid begin using shared libraries 
without changing anything. In addition, the behavior of 
most shared library implementations with respect to 
precedence of definitions differs dramatically from ar- 
chive library behavior If an entry point is defined in both 
the program and an archive library, only the definition 
from the program would be used in the program, and 
calls to that routine from the library would be bound to 
the definition in the program, not the one in the library. 
Such a situation is called preemption because the defini- 
tion in the program preempts a definition of the same 
name in the library and all references are bound to ih< 
Tu si definition. 

Another design goal that we followed was that the 
dynamic loader must be a user-level implementation. The 
only kernel support we added was a general memory- 
mapped file mechanism that we use to load shared 
libraries efficiently The dynamic loader itself is a shared 
library that is bootstrapped inlo (he process's address 
space by the startup code. 

We also wanted 10 case the task of building shared 
libraries. We explicitly avoided any design thai would 
require the programmer to modify library source code or 
follow a complicated build process. 

Finally, we recognized that, in the absence of an obvious 
standard, our shared library- model should not be signifi- 
cantly different from other implementations based on 
AT&T's System V Release 4. 

PA-RISC Design Issues 

Although tire HP-UX shared library implementation is 
designed to have the same external interface and behav- 
ior in the HP MM Series 300, 800, and 700 systems, 
i -si net ions imposed by the PA-RISC systems (Series 700 
and 800 systems) posed some interesting design consider- 
at ions I hat resulted in additional complexity in I he 



underlying implementation. One of the main restrictions is 
based on the PA-RISC software architecture for virtual 
memory and the lack of a facility in the operating system 
to handle the situation. 

laJ memory in PA-RISC is structured as a set of 
address spaces each containing 2 s2 bytes ( see Fig. 4)„ 8 A 
virtual address for a processor thai supports a 64-bit 
address is constructed by the concatenation of the 
contents of a 32-bit register called a space register and a 

I offset. The PA- RISC software architecture divides 
each space into four lG-byte quadrants, with four space 
re sisters i srl to sr7) assigned to identify a quadrant (see 
Fig. 5). This scheme requires that text and data be loaded 
into separate spaces and accessed with distinct space 
pointers. Program text is accessed using sr4 h shared 
library text is accessed using srb\ and all data for shared 
libraries and program files is accessed using sr5. This 
architecture does not allow contiguous mapping of text 
and data in an exerutable file;* Therefore, to handle 
shared libraries m PA-RISC we had to have a dedicated 
linkage table pointer register and provide support for 
interspace procedure calls and returns. 

Dedicated Linkage Table Pointer. Since code and data could 
not be mapped contiguously, the linkage tables could not 
be accessed with a pc-relative code sequence generated at 
compile time. Therefore, we chose a general register 
(grlO) as a place for holding the pointer for shared 
library linkage. All position independent code and data 
references within a shared library go indirectly through 
l he grl9 linkage register. Code in a main program ac- 
cesses the linkage table directly since the main program 
code is not required to be position independent 

Position independent code generation for shared libraries 
must always consider the grl9 linkage register as being 
live (in use), and must save and restore this register 
across procedure calls. 

The ptabel. The dedicated linkage table pointer added 
complexity to the design for handling procedure labels 
and indirect procedure calls. Two items in the PA-RISC 
software architecture had to be modified to include 
information about I he linkage table pointer: a function 
pointer called a plabel (procedure label), which is used in 
archive HP-UX libraries and programs, and a millicode 
routine called SSdyncall, which is used when making 
indirect function calls. To support this new plabel defini- 
tion the following changes had to be made. 

• In programs that use shared libraries, a plabel value is the 
address of the PLT entry for the target routine, rather 
than a procedure address. An HP-UX PA-RISC shared 
library plabel is marked by setting the second-to-last low- 
order bit of the plabel (see Fig. 6) t 

■ The SSdyncalf routine was modified to use this PLT address 
to obtain the target procedure address and the target grlJJ 
value. In the modified implementation, the SSdyncall rou- 
tine and the kernel's signal-handling code check to see if 
the HP-UX shared library plabel bit is set, and if so t the 
library procedures address and linkage table pointer- 
values can be obtained using the plabel value. 



' The HP 9000 Series 3Q0 systems dD support contiguously mapped text and data. 
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Interspace Calls and Returns, The second significant impact 
on the shared library design was the nee<\ for a way to 
handle Interspace calls and returns because in the PA- 
RISC software architecture, program text and shared 
memory text are mapped into separate spaces. 

The default procedure call sequence generated by the 
HP-UX compilers consists of intraspace branches (QL 
instruction) and returns (BV instruction). The compilers 
assume that all of a program's text is in the same virtual 
space. To perform interspace branches, an interspace call 
and return sequence is required The call and return 
sequence for an interspace branch is furl her complicated 
by the fact thai the target space is not known at compile 
time, so a simple interspace branch instruction (BLE 
affset(srX,base)) is not sufficient. Instead, a code sequence 
that loads the target space into a space register and then 
performs an interspace branch is required. 

The HP-UX memory map implementation mmapO is used 
for mapping shared library text. As mentioned earlier, all 
shared library text is mapped inlo the sro space (quad 3 
addresses) and all data is mapped into the sr5 space 
(quad 2 addresses), This mapping, along with the need to 
have a dedicated position independent code linkage 
register, requires special code sequences to be produced 
for each function in the library. These code sequences are 
referred to as stubs. The linker places stubs into the 
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Fig. 5, The relationship of space 
registers sr4, sr5, sr6 : and sr? to 
the virtual address spares. 

routine making the call and in the library routines (and 
program files) being called to handle saving and restoring 
the grl9 linkage register and performing the interspace 
branch (see Fig. 7). As mentioned above, compilers 
generate an intraspace branch (BLJ and an intraspace 
return (BV) for procedure call sequences, The linker 
patches die BL to jump to the import stub code (CD in Fig, 
7), which then performs the interspace branch to the 
target routine's export stub (I 1 in Fig. 7). The export stub 
is used to trap the return from the call, restore the 
original return pointer and execute an interspace branch. 

HP-UX User Interface 

The HP-UX shared library design offers various user 
interface routines that provide capabilities to dynamically 
load and unload libraries, to define symbols, and to 
obtain information about loaded libraries and symbol 
values. All of these user interface routines are designed 
to be used by a user-level program to control the runtime 
loading and binding of shared libraries. 

Library Loading. Shared libraries can be loaded either 
programmatic ally (explicit loading) or via arguments in 
the link command line (implic.il loading). Explicit loading 
and unloading are provided through the shljoadd and 
shl_unload{l routines. Libraries specified for implicit loading 
are mapped into memory at program startup. There are 
two main binding modes for loading shared libraries: 
immediate binding and deferred binding. For implicit 
loading the binding modes can be specified on the link 
command line using the -B immediate or -B deferred linker 
command line options. The default niude for implicit 
shared libraries is deferred binding, For explicit loading 
the binding mode is specified by using the BINDJMMEDIATE 
or BIND_DEFERRED flag in shlJoadOs argument list. 

The deferred binding mode will bind a code symbol when 
i tie symbol is first referenced, and will bind all visible 
data symbols on program startup. The data symbols must 
be bound when the library is loaded since there is no 
mechanism for trapping a data reference in PA-RISC 
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Fig. 7. Shared library prut vdurc (alls, 

(see "Deferred Binding, Relocation, and Initialization of 
Shared Library Data" on page 52 for more details). The 
deferred funding mode spreads the symbol-binding time 
across the life of the program, and will only bind 
procedures that are explicitly referenced. 

The immediate binding mode binds all data and code 
symbols at program startup. If there are any unresolved 
symbols, a fatal error is reported and loading is not 
completed. The program will abort if unresolved symbols 
arc* detected while loading any implicitly loaded libraries 
when immediate binding is used. The immediate binding 
mode forces all of the symbol binding to he done on 
startup] so the binding cost is paid only at startup. The 
immediate binding mode also has the advantage of 
dMerminirig unresolved symbol references at startup. (In 
deferred binding mode the program could be running for 
some time before an unresolved symbol error is detected ) 

Additional flags are available for explicitly loading shared 
libraries thai alter the behavior of the immediate and 
deferred binding modes. These flags provide the user wiih 
some control over the binding time and binding order of 
shared library symbols, and are used in conjunction with 
the BINDJMMEDIATE and BINO_DEFERRED flags. 



• BIND_FIRSLThls option specifies that the library should be 
placed at the head of the library search list before the 
program file. The default is to have the program file at the 
head of the search list and to place additional shared li- 
braries at the end or the search list. All library searching 
is done from left (head) to right (tail). 

• BIND NONFATAL When used with the BINDJMMEDIATE flag, 
ibis flag specifies that if a code symbol is not found at 
start up time, then the binding is deferred until that code 
symbol is referenced (this implies that all unresolved 
code symbols will be marked deferred with no error or 
warning given, and all unresolved data symbols will pro- 
duce an error). The default immediate binding behavior is 
in abort if a symbol cannot be resolved, This option al- 
lows users to force all possible binding to be done at 
startup, while allowing the program file to reference sym- 
bols that may be undefined ai startup but defined, later in 
the execution of the program. 

• BlNDJJOSTAfiT. This flag specifies that the shared library 
initializer routine should not be called when the library is 
loaded oi unloaded The initialize) routine ls specified 
using the 4 1 linker option when the shared library is built. 
Default behavior is for the dynamic loader to call the in- 
itializer routine, if defined, when the shared library is 
loaded, 
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Deferred Binding, Relocation, and Initialization of Shared Library Data 



In most shared library implementations, including the HP-UX implementation, a 
shared library can be loaded at any address at run time While position indepen- 
dent code is used far library text, library data must be relocated at run Time after a 
load address has been assigned in the library, Also, because addresses of symbols 
defined by shared libraries are not known until run time, references from applica- 
tion programs to shared libraries cannot be bound to correct virtual addresses a* 
link time, nor can references between shared libraries or from shared libraries to 
application programs be reserved at library build time. Instead, all such references 
are statically resolved to linkage tables Each entry in a linkage table corresponds 
to a specific symbol When a program that uses shaded libraries is executed, tiie 
loader must initialize each linkage table entry with the address of the corresponding 
symbol 

Furthermore, languages such as C++ support run-time initialization of data A 

class can have a constructor, which is a function defined to initialize objects of 
that class. The constructor is executed when an abject of that class is created 
C++ mandates that the constructors for nonlocal static objects in a translation 
unit* be esteemed before the first use of any function or object defined »n that 
module Other languages, such as Ada, may heve similar run-time initialization 
requirements 

The dynamic loader must therefore perform relocation, binding, and initialization at 
run time Linkage table entries for function calls can be initialized to trap into the 
dynamic loader, so that the binding of a function reference can he deferred until 
the first call through the reference On the other hand, data references cannot be 
trapped in this manner on most architectures Thus, in most shared library imple- 
mentations, the dynamic loader must perform all relocation, brnding, and initializa- 
tion of data for an entire library when that library is loaded. This normally implies a 
high startuc cost for programs that use shared libraries. 

Module Tables 

The HP-UX design conceptually maintains some of the boundaries between the 
modules that make up a shared library Ail export table entries, linkage table en- 
tries, relocation records, and constructors are grouped by translation unit into 
module tables The dynamic loader defers the binding, relocation, and initialization 
of data for a module until the first potential access of any symbol defined in that 
module. This greatly reduces the startup overhead of programs that use shared 
libraries 

Since the Series 700 architecture does not support trapping on specific data refer- 
ences, the dynamic loader cannot directly detect the first access of a given data 
symbol. Instead, the dynamic loader considers a given data symbol to be potential- 
ly accessed on the first call so any function that references the symbol Rather than 
actually keeping track of which functions reference which data symbols, the mod- 
ule table allows the dynamic loader to make a further approximation. On the first 
call to a given function, the dynamrc loader considers the whole module to have 
been potentially accessed. It consults the module table to determine which linkage 

' A static obiect is an object That Mves throughout the life at the program, and a trans Ian on 
unit is the source file produced after going through the &H preprocessor 



table entries to bind, which relocation records to apply, and which constructors to 

execute 

This algorithm is recursive, since binding linkage table entries, relocating data, and 
executing constructors all may reference symbols in other modules These modules 
must also be considered to be potentially accessed The dynamic loader must 
therefore bind, relocate, and initialize data in those modules as well If libraries 
typically ca nta.'n long chains of data references between modules, then this algo- 
nthm will be processing data fox many modules on the first call to a given library 
function. If the library is completely connected by such references, this algorithm 
degenerates into brnding, relocating, and initialling all data for an entire library 
the first time any function in that library is called. However, our experience shows 
that typical libraries seldom have chains more than three or four modules long, and 
many programs access only a fraction of The total number of modules in a library. 
Deferring the binding, relocation, and initialization of data on a module basis has 
shown that the time spent performing these tasks can be reduced by 50% to 30%, 
depending on the program and libraries involved. 

Further C++ Considerations 

The C++ definition of static destructors adds another complication to the design. A 
destructor for an object is executed when the object is destroyed Static objects 
are considered destroyed when the program terminates C++ mandates that de- 
structors for static objects be called in reverse order from the constructors. Other 
languages may have different semanttcs. Therefore, the dynamic loader employs a 
more general technique. Rather than execute constructors directly when process- 
ing data for a module, the dynamic loader executes a function called an elaborator, 
which is defined by the C++ run-time support code. The C++ elaborator executes 
all static constructors for the module and also inserts any corresponding destruc- 
tors at the bead of a linked list On program termination, the C++ run-time support 
code traverses this list and executes all destructors 

The HP-UX shared library design also supports explicit loading and unloading of 
shared libraries from within a program via the shFJoad and shl_unloed functions 
described in the accompanying article on page 50 While C++ does not define any 
particular semantics for dynamic loading and unloading of libraries jt seems natu- 
ral to execute static destructors for objects defined in an explicitly loaded library 
when the library is unloaded. Since the destructors for objects defined in a library 
are often defined in the library itself, the dynamic loader clearly cannot wait until 
program termination to execute destructors for objects in libraries that have al- 
ready been unloaded. Therefore, the dynamic loader invokes a library termination 
function when a library is unloaded. This function, also defined by the C-h- run- 
time support system, traverses the linked list of destructors and executes all de- 
structors for the library being unloaded It then removes those destructors from the 
list. For symmetry, the dynamic loader also invokes an initialisation function when 
a library ts loaded, implicitly or explicitly, but this capability is not used by the C++ 
implementation. 

MarcSabatella 

Software Development Engineer 

Systems Technology Division 



* BIND_VERBOSE. This flag causes messages to be emitted 
when unresolved symbols are discovered. Default behav- 
ior in the immediate bind mode performs the library load 
and bind silently and returns error status through the 
return value and errno variable. 

Other user interface routines are provided for obtaining 
information about libraries that have already been loaded. 

* shl_getj}. This routine returns information about currently 
loaded libraries, including those loaded implicitly at start- 
up time. The lihrary is specified by the index, or ordinal 
position of the shared library in the shared lihrary search 
list. The information returned includes the library handle, 



pathname, initializer address, text start address, text end 
address, data start address, and data end address. 
shLgerhandlel). This routine returns the same Information 
its the shl_get0 routine, but the user specifies the library of 
interest by the library handle rather than the searclvorder 
index, Typically, the shLgetf) routine would be used when 
a user wants to traverse through the list of libraries in 
search order, and the $hl_get_hanrjle() routine can be used 
to get information about a specific library for which the 
library handle is known (Le. t explicitly loaded libraries). 
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Dynamic Symbol Management I'ser interface routines 
provided for dynamic symbol management include sftl_fmd- 
sym(| and shLaBftnesymlL The stif_findsym|) routine is used to 
obtain the addresses of dynamically luaded symbols so 
that they can be called. The shLfindsymO interface is the 
only supported way of calling dynamically loaded rou 
and obtaining addresses for dynamically loaded data 
items. The shLdeftnesymd routine allows the user to dynam- 
ically define a symbol that is to be used in future symbol 
resolutions. The user provides the symbol name, type, and 
value. If the value of the symbol falls within range of a 
library that has previously been loaded, then the newly 
defined symbol is associated with that library and will be 
removed when the associated library is unloaded. 

Other Features 

Other features that are specific to the HP-UX shared 
Library implementation include a module-level approach to 
version control, a method of symbol binding that reduces 
the shared library startup cost for programs that use a 
small percentage of the library routines, and special C++ 
support, The special O+ support is described in the 
short article on the previous page. 

Version control. One of the advantages of shared libraries 
is that when a change ^e.g., a defect repair) is made to 
the library; all users can take immediate advantage of the 
change without rebuilding their program files. This can 
also be a disadvantage if the changes are not applied 
carefully, or if the change makes a routine incompatible 
with previous versions of that routine. To protect users of 
shared libraries some type of version control must be 
provided. 

The HP-UX shared library approach to version control is 
provided at the compilation unit (module) level, which is 
unlike most existing implementations that provide version 
control only at the library tevej. Our version control 
scheme is based on library marks that are used to identi- 
fy incompatible changes. When an incompatible change is 
made to a routine, the library developer date-stamps the 
routine using a compiler source directive. The date is 
used as the version number and is associated with all 
symbols exported from thai module. The resulting module 
can then be compiled and added to the shared library 
along with the previous versions of that module. Thus, 
the date stamp is used as a library mark that reflects the 
version of the library routine, When a user program file Ls 
iiu i It, the mark of each library linked with the program is 
recorded in the program file. When the program is run, 
the dynamic loader uses the mark recorded in the pro- 
gram rile to determine which shared library symbol is 
used for binding, The dynamic: loader will noi accept any 
symbol definitions that have a mark higher than the mark 
recorded for the defining library in the program file. 

This scheme can also be used for changes thai are 
backwards compatible and for programs that rely on new 



behavior. In this case, library developers would include a 
dummy routine with a new date to force an increase in 
the library's mark. Any new programs linked with this 
library would have the new mark recorded, and if run on 
a system with an older version of the library, the dynamic 
loader will refuse to load the old library because the 
version number of the installed library would be lower 
than the number recorded in the program file. 

Archive Symbol Binding. Typically, a shared hbrar. 
treated as one complete unit, and all symbols within the 
library are bound when any symbol in that library is 
referenced, in the HP-CX scheme, the shared Library file 
maintains module granularity similar to archive libraries. 
When the shared library is built, a data structure within 
the shared library is used to maintain the list of modules 
(compilation units) used to build the library. The list of 
defined symbols and referenced symbols is maintained for 
each module During symbol resolution, the dynamic 
Loader binds only symbols for modules that have been 
referenced. This symbol binding technique provides a 
significant performance improvement in the startup and 
symbol binding time for typical programs (i.e., programs 
that reference a relatively low percentage of the total 
symbols in the attached shared libraries ). 
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Integrating an Electronic Dictionary 
into a Natural Language Processing 
System 

This paper discusses the types of electronic dictionaries available and the 
trends in electronic dictionary technology, and provides detailed discussion 
of particular dictionaries. It describes the incorporation of one of these 
electronic dictionaries into Hewlett-Packard's natural language 
understanding system and discusses various computer applications that 
could use the technology now available. 

by Diana C. Roberts 



Computational linguistics is demonstrating its relevance to 
commercial concerns. During the past few years, not only 
have companies funded and carried out research projects 
in computational linguistics, but. also several products 
based on linguistic technology have emerged on the 
market , Franklin Products has created a line of handheld 
calculator-like dictionaries which range from a spelling 
dictionary to a pronouncing dictionary whh a speech 
generator attached to a full thesaurus. Franklin Products, 
Texas Instruments, Casio, and Seiko all produce multilin- 
gual handheld translating dictionaries. Many text editors 
and word processors provide spelling checkers and 
thesauruses, such as Ihnse used by WordPerfect. Grani- 
matik IV and Grammatik Mac are widely available style 
and grammar checkers. Merriam-Wehster and the Oxford 
University Press have recently released their dictionaries 
on CD-ROM. 

Both the commercial success of these linguistics products 
and the promising nature of their underlying theoretical 
basis encourage more ambitious work in industrial 
research. Outside of the United States, particularly in 
Europe and Japan, there is great interest in machine 
translation, although products remain on the research 
level The Toshiba Corporation has developed a Japanese- 
English iyped4nput translating system for conversational 
language. Within I he United States, Unisys, SRI, and 
Hewlett- Packard! have developed natural language under- 
standing systems with prospective applications of data- 
biise inquiry and equipment control, among other areas. 
In the area of electronic dictionary development, both the 
Centre for lexical Information (CELEX) in the Nether- 
lands and Oxford University Press (publishers of the 
Oxford English DicTkinary :■ in England art' developing 
dictionary products that are sophisticated both in the 
linguistic data they contain and in the way the data is 
accessed. 



t Hewlett-Packard's HP-Nl [Hewlett-Packard Natural Language} system wss under develop- 
ment from 19B2ra 1991. ' 



The linguistics of computational linguistic theory is based 
on standard modern theories such as lexical functional 
grammar, or bFCi b - and head -driven phrase-structure 
grammar, or ill'SG $ Most of these theories assume the 
word to be the basic linguistic element in phrase forma- 
tion, that is, they are "lexicalized" theories. Words, 
therefore, are specified in great linguistic detail, and 
syntactic analysis of sentences is based on I he interplay 
of the characteristics of the component words of a 
sentence. Therefore, products based on linguistic theory 
such as grammar checkers and natural language under- 
standing systems require dictionaries containing detailed 
descriptions of words. Products that do not involve 
sentential or phrasal analysis, such as spelling checkers 
and word analyzers, also require extensive dictionaries. 
Thus, dictionaries are very important components of most 
computational linguistic products. 

Of course, I he book dictionary has been a standard 
literary tool, and the widespread acceptance of the 
computer as a nontechnical tool is creating an emerging 
demand tor standard dictionaries in electronic form. In 
fact, the importance of linguistically extensive dictionaries 
to computational linguistic projects and products is 
reflected in ihe emerging availability of electronic dictio- 
naries during the past few years. Webster's Ninth on 
CD-ROM, a traditional type of dictionary now in electron- 
ic form, became available from Me rriam- Webster in 1990- 
Longman House made the typesetting tape for its Long- 
man's Dictionary of Contemporary English (LDOCE) 



Notation and Conventions 

In this article, italic type is used fat naturaf language words crted in the text (eg (i 

The sans-serif font is used for programming keywords 

The asterisk (*} preceding a phrase indicates ungrammaticahtv 

The dagger (tj indicates a footnote. 
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available to the research community of linguists, lexicog- 
raphers, and data access specialists in the mid-1980s. It is 
still a research tool but has become very expensive to 
commercial clients, presumably reflecting its value to the 
research and commercial community. The products from 
CELEX and the Oxford University Press mentioned above 
are among the most sophisticated electronic dictionary 
products available today. The second edition of the 
Oxford English Dictionary, the OED2, is available on 
CD-ROM. Its data T in running text form, is marked explic- 
itly in SGML (the Standard Generalized Markup Language, 
ISO 8879) for retrieval. The CELEX dictionary in English. 
Dutch, and German is a relational database of words and 
their linguistic behavior. These two latter products are 
sophisticated in very different ways and are designed for 
very different uses, but they have in common a great 
flexibility and specificity in data retrieval. 

Related work that can support further lexicographical 
development in the future is being carried on by the Data 
Collection Initiative (DO) through the Association for 
Computational Linguistics, This and similar initiatives are 
intended to collect data of various forms, particularly 
literary, for computer storage and access. Lexicographers 
are already making use of large corpora in determining 
the coverage for their dictionaries. 4 The availability of 
large corpora for statistical studies will certainly aid and 
may also revolutionize lexicographical work and therefore 
the nature of electronic dictionaries. 

There is apparently a commercial market for linguistically 
based products, since such products arc already being 
sold. Many of these current products either rely on 
electronic dictionaries or are themselves electronic 
dictionaries of sonic kind. Recent years have seen elec- 
tronic dictionaries become more sophisticated, both in 
their content and in the accessibility of their data. Be- 
cause many sophisticated products based on computation- 
al linguistics must rely on dictionary informal ion. the 
potential scope of computational systems based on 
linguistics has increased with the improvements in 
electronic dictionaries. 

My aim with this paper is to introduce the area of elei - 
tronir dictionary technology, to suggest areas of research 
leading to product development thai crucially exploit the 
emerging dictionary technologies, and to report on the 
resti Its of one such effort at Hewlett-Packard Laboratories, 

What Is a Dictionary? 

Commonly, a dictionary Ls considered to be a listing by 
Spelling of common natural language words, arranged 
alphaheiically typically with pronunciation and meaning 
information. There are, however, collections of words that 
violate one or more of these three stereotypical character- 
istics Ijiii are still considered dictionaries. For instance 
the simplest kind of dictionary is the woni list, used for 
checking spelling; it contains no additional wont Informa- 
tion, The Rildworterbin h from Dud en contains both 
pictures and words for each entry, and is arranged not 
alphabetically but by topic, Stedmans Medical Dictionary 
contains alphabetically ordered technical terms from the 
domain of medicine and their definitions rather than 
common English words. It also contains some etymologi- 
cal information, but offers pronunciation Information for 



only some entries. Similarly, symbol tables of compilers 
contain symbols used by software programs; their entries 
are not natural language words. Data dictionaries of 
database management systems also contain entries for 
non-natural-language words, as well as other nonstandard 
dictionary information such as computer programs. 

If aO three of the stereotypical characteristics can be 
violated, then for the purposes of this paper we need to 
establish what a dictionary is. As a start, we can appeal 
to a dictionary as an authority on itself. Webster's Ninth 
New Collegiate Dictionary (the electronic version of 
which is one of the dictionaries discussed in this paper) 
says that a dictionary is "1: a reference hook containing 
words usu. alphabetically arranged along with information 
about their forms t pronunciations, functions, etymologies, 
meanings, and syntactical and idiomatic uses 2: a refer- 
ence book listing alphabetically terms or names important 
to a particular subject or activity along with discussion of 
their meanings and applications 3: a reference book 
giving for words of one language equivalents in another 4: 
a list (as of phrases, synonyms, or hyphenation instruc- 
tions) stored in machine-readable form (as on a disk) for 
reference by an automatic system (as for information 
retrieval or computerized typesetting)." 

There are some common elements of these definitions, 
which together form the defining characteristics of the 
dictionary. First and most crucial, the dictionary is a 
listing of language elements, commonly words. Implied 
too is that the entries can be taken from any domain, 
These entries are arranged in some way to make retrieval 
either possible or easy. And finally, the dictionary also 
often contains other information associated with the 
entry. An electronic dictionary is any kind of dictionary in 
machine-readable form. 

The electronic dictionaries available now vary greatly. 
This paper will only consider dictionaries whose entries 
come from the domain of natural language, and whose 
entries are words rather than phrases. I will discuss three 
dimensions along which electronic dictionaries differ from 
each other: type of additional information about the entry 
presented, the explicit ness of the information categories 
(more explicit marking of the categories reducing ambigu- 
ity), and the accessibility and organization of the data- 
After the discussion of electronic dictionaries and their 
characteristics, I will discuss the possible uses of elec- 
nnnic dictionaries and the necessary characteristics of 
i he dictionaries for the various possible uses. The pur- 
poses to which an electronic dictionary can be put 
depend on its characteristics in each of the three 
dimensions discussed in I he following sections, 

Evolution of Electronic Dictionaries 

Early electronic dictionaries were word lists. They had a 
limited range of use because they contained limited types 
of information in a simple organization. Electronic dictio- 
naries are becoming more complex and more flexible 
now, as they become potentially more useful in domains 
that did not exist before. The potential uses are shaping 
the ways in which electronic dictionaries are evolving. 

As computers began to he used Ibr writing and commu- 
nication, the standard desk reference hook, the dictionary, 
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was ported to electronic form* When this reference tool 
became available in the same medium as the word 
processor — the computer — the lexicographical information 
was now machine-readable. As linguistically based soft- 
ware systems matured, the demand for accessible lexico- 
graphical data based on modern linguistic theory grew. 
This demand has brought several important pressures on 
electronic dictionaries. 

Lexicographical Information. First, several newer electronic 
dictionaries provide extensive linguistic information based 
in some cases on modern linguistic theories. Word entries 
in traditional dictionaries often do not recognize the same 
categories that are important to generative linguistic 
theory. Traditional dictionaries focus on defining words, 
providing historical derivation information ( etymologies ) r 
providing sample sentences to illustrate word use, and 
providing some basic linguistic information, such as 
spelling, syllabification, pronunciation, pari of speech, 
idiomatic use, and semantically related words (synonyms, 
byponyms, hypernyms, and antonyms), t This information, 
if it were unambiguously accessible, could be used for 
some software applications — for example, a spelling 
checker, a semantic net, or possibly speech generation or 
recognition, depending on the sophistication of the speech 
system. IIow T ever, this information is insufficient for 
applications that require complex word and/or sentence 
analysis, such as natural language processing, which 
involves natural language understanding, 

The recent expansion of electronic dictionaries coincided, 
not surprisingly, with the emergence of several book 
dictionaries of English that carefully detail linguistic word 
information based on modern linguistic theory. These 
"Teaming dictionaries," created for foreign learners of 
English rather than native speakers, contain only the 
most common words instead of attempting to be exhaus- 
tive. These dictionaries concentrate more on complete 
syntactic and morphological characterization of their 
entries than on exhaustive meaning explanations and 
citations, and use linguistic categories from modem 
generative linguistics in preference to traditional catego- 
ries. Three of these dictionaries are Longman's Dictionary 
of Contemporary English (LDOCE), the Oxford Advanced 
Learner's Dictionary of Current English (OALD), and 
Collins COBUILD English Language Dictionary. Some of 
the most useful electronic dictionaries draw their 
lexicographical information from these sources. *t 

The following are some of the kinds of information found 
in electronic dictionaries, both traditional and modem: 

• Orthography (spelling) 

• Syllabification 

• Phonology (pronunciation) 

• Linguistic information about the word's properties, in- 
cluding syntax, semantics (meaning), and morphology 
(word structure) 

• Related word(s) — related by morphology, either inflec- 
tional {work, works) or derivational (happ& unhappy) 

I In a pair of words, one of wnich has a broader meaning than ihe other, the word with the 
broader meaning -s the hypemym ana tte word with ihe mo:e narrow meaning 15 the hyponym 
Fof example, lor ihe words bonk and novel, cook would be the hypemym and novel the hymsnym 
*t Extensrve and explicit phonetic, morphological, and syntactic information is useful now \n 
computer applications, whereas neither sefnantic nor etymological information is yet structured 
enough to be useful. 



• Synonym listings 

• Semantic hierarchies 

• Frequency of occurrence 

• Meaning (not yet a robustly structured field) 

• Etymology 

• Usage (sample phrases and/or sentences, either created 
by the lexicographer or cited from texts). 

Data Categorization. A second trend in newer electronic 
dictionaries is to represent the lexicographical data in such 
a way that it is un ambiguously categorized, either tagged 
in the case of running text dictionaries, or stored in a 
database. Linguistically based software systems must be 
able to access lexicographical information unambiguously. 

Traditional dictionaries rely on human interpretation of 
various type faces which are often formally ambiguous to 
determine the category of information represented. In the 
entry for "dictionary" in Webster's Ninth , the itaiic type 
face is used to represent hoih the part -of-speech and 
foreign-language etymological information, and the part-of- 
speech indicator comes alter the entry for "dictionary" 
and before the entry for the plural "-naries". 

One of the earlier desk-type dictionaries in electronic 
form was the Longman's Dictionary of Contemporary 
English. The tape containing the typesetting information 
for the book form was stored electronically. Thus, all its 
lexicographical information was available electronically, 
but the data fields were ambiguously indicated through 
the typesetting commands. 

The second edition of the Oxford English Dictionary 
(OED2) is available in an electronic edition on CD-ROM, 
This dictionary; like the LDOCE, is a running text dictio- 
nary rather than a regular database, Its data fields, 
however, are explicitly marked using the SGML lagging 
language. Here, data retrieval does not face the problem 
of ambiguity. 

The CELEX electronic dictionary is in relational database 
form. This encourages uniformity in the classification 
system and in the data. 

Accessibility of Data. A third trend is the increased acces- 
sibility to their data offered by some electronic dictio- 
naries. Accessibility is affected by both data structure and 
data encryption. In some dictionaries, the entry point for 
retrieval is only by word spelling; in others, there are 
multiple entry points. The data of some dictionaries is 
designed intentionally to be inaccessible progranimatiealiy 
in other cases, it is designed to be accessible. 

In word lists such as spelling dictionaries, data organiza- 
tion is not extensive, usually consisting of only alphabeti- 
cal ordering and indexing to allow 7 fast access. The data 
in these dictionaries is fully accessible to at least the 
spelling software application, and may be designed in a 
way that it could be used by other software applications 
as well. For instance, the tspeil dictionary is a word list in 
ASCII format that can be used for other purposes. 

Other dictionary types vary more in their accessibility, 
both in whether the data is accessible at all progranmiait 
eally, and in how flexible Uie access is. The data in the 
OED editions and in Webster's Ninth is encrypted 1 an 
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1st 
2nd 



as 

AAA 

aardmp 

aaftfvsrk 

aafriwoif 

a a rd wo Ives 

Aarhus 

Aaron 

ABA 

Ababa 



aba eta rial 

abacus 

abacuses 

Fig, 1, The first few entries in the dictionary vised by the spelling 
Cheeky ispeli 

unencrypted version of the OED2 is available, but expen- 
sive). In both the LDOCE and the CELEX dictionaries, the 
data is available programmatically. 

In Webster's Ninth, Ihe user can access data only through 
spelling. In the OED, the user can access data through 
spelling, quotation year, original language, and many other 
atrribut.es. In the CELEX dictionary, which is a relational 
database, access is completely flexible. And in LDOCE, 
the user can access the data through selected fields in 
the data entries. 

Coverage, A fourth trend in modern electronic dictionaries 
is that, as linguistically based software systems become 
larger, the need grows for representing all and only those 
words relevant to a particular application in the electron- 
ic dictionary, All relevant words must be represented to 
allow for complete coverage, It is desirable also to 
represent only relevant words to improve storage and 
retrieval costs. At least, one of the learning dictionaries 
discussed above, Collins COBUILD dictionary, chose its 
selection of entries not by the traditional approaches of 
introspection and of searching for unusual word uses, but 
rat lie r by amassing a corpus of text and entering word 
occurrences in that corpus into its dictionary. This should 
result in a dictionary with a vocabulary' representative of 
the domain in which the corpus was gathered rather than 
an idiosyncratically collected vocabulary, This approach 
yields an additional desirable characteristic: statistical 
studies on the corpus can indicate frequency of word use, 
which can be used in ordering both linguistic uses and 
meaning uses of the same word, litis frequency information 
could be useful to software applications. 

Sample Dictionaries 

Word lists contain word spellings, sometimes accompanied 
by syllabification, pronunciation, or frequency Information. 
An example Is the dictionary used by the spelling checker 
ispeli. Fig. 1 shows the firsi few entries in this dictionary, 
which contains word entries by spelling only. 

Another well-known word list electronic dictionary is the 
Brown corpus, which was constructed from statistical 
studies on linguistic texts, ll provides spelling and part-of- 
s pee eh information, However, its great contribution is its 
frequency information, which records the frequency of 



different words with the same spelling. Frequency lists 
usually collapse word frequencies to occurrences of a 
spelling instead of accounting for homonyms. 

Other electronic dictionaries contain more extensive data 
than do word lists. The desktop dictionary Webster's 
Ninth on CD-ROM, For instance, provides spelling, syllabi- 
fication, pronunciation, meaning, part of speech, and 
some etymological information for each word, 

The Oxford English Dictionary on CD-ROM desktop 
dictionary also provides extensive information. Its data 
includes spelling, etymology (parent language), part of 
speech, quotations to demonstrate context, year of 
quotation, and meaning. 

The data in the machine-readable Longman's Dictionary 
for Contemporary English (LDOCE) and in the CELEX 
lexical database from the Centre for Lexical Information 
is also extensive and includes phonology, syllabification, 
part of speech, morphology, specification of the argu- 
ments that occur with the word in a phrase, such as 
subject, object and so on ( su beat egorizat ion), and in the 
CELEX dictionary, frequency. Also, while the desktop type 
of electronic dictionary does not typically contain linguis- 
tic information that coincides with modern linguistic 
theories, these two electronic dictionaries contain work 
categorizations based on modern linguistic theory. Much 
of the CELEX dictionary's syntactic data is based on 
categories from the LDOCE and the OALD. 

Fig, 2 shows examples from the CELEX electronic dictio- 
nary. In these examples, the lemma is the root word, the 
part of speech is the major word class to which the word 
belongs, the morphology is the formula for deriving the 
spelling of the fleeted word from the lemmas spelling, 
morphological information includes morphological characteristics 
of the word (singular, comparative, etc,), and flection type 
contains the same morphological information compressed 
to one column. 

The HP Natural Language System 
A natural language processing system is a software 
system that takes natural language input (typically spoken 
or typed input) > assigns some interpretation to the input, 
and perhaps transforms thai interpretation into some 
action. Examples of natural language processing systems 
are speech recognizers, machine translators, and natural 
language understanding systems. 

HP's natural language understanding system, HP-NL, 
accepts typed English input and performs morphological, 
syntactic, and semantic analysis on the input. If the 
sentence is well* formed, HP-NL assigns a logic representa- 
tion to the sentence. HP-NL can then translate this logic 
expression into another language, for instance a database 
query language such as SQL. 

The linguistic coverage of HP-NL is limited by, among 
other factors, the size of its lexicon, or its word invento- 
ry. To increase ihe size of the lexicon and therefore the 
coverage of the software system, and to demonstrate that 
electronic dictionaries can be used to solve problems of 
computation, we integrated the CELEX lexical database 
into the HP-NL natural language understanding system. 
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abacus 
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abandon 

abandoned 
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entry iemma spelling 



6 
7 

B 

3 
ID 

11 
12 



abandon 

abandon 

abandoned 

abandoned 

abandoning 

abandonment 

abandons 



syllabification 

a-ban-don 
a-ban-don 

a-bandoncd 
a-ban-doned 
abandoning 

fi-hrtn-dori-menr 
a-ban-don-s 



morphology frequency 

143 

36 

321 

+ed 64 

■nng 75 
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flection 


morphological 


type 


information 
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NNNNNYNNNNNNN 
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YNNNNNIYNYYNNN 


pe 
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YNNNNNNNNNNNN 


e3S 


YNNNNNNYNNMYN 



Fig, 2, Examples of data t'nnu 
tin- CELBX electronic dictionary. 



Natural Language Processing Technology 

Currenl linguistic technology relics on detailed linguistic 
specification of words. Linguistic analysis is based on 
basic information about words, the building blocks Of 
phrases. We will restrict our consideration to two central 
kinds of linguistic information: morphological and syntac- 
tic information. 

Morphological information specifies which component 
parts of a word (morphemes) may occur together to 
constitute a word. Morphemes may show up on a re- 
stricted class of words. For instance, the prefix tin- may 
appear on adjec lives, and the suffix -s may occur on 
third-person veil is: 



la. happy 
b. un+happv 
q. work 

d. * i in + work 

2a. work 

b. work+s 

c. happy 

d. * happy-hs 



(adjective) 
(adjective) 
{verb) 

(base verb) 

(present Lhird-person-singular verb) 

(adjective) 



Syntactic information specifies how words interact with 
other words to form phrases. Two important kinds of 
syntactic information are pan of speech and subcalego- 
rizatiom Pan of speech is the major word category to 
which the word belongs — for example, noun, verb, 
adjective, and so on. Part of speech is important in 
determining which words may occur together and where 
ihcv may occur in a sentence. For instance, a verb does 
not Typically occur as the first element in a declarative 
English sentence 

3a. She finished repairing the broken toy. 
b.* Finished she repairing the broken toy. 
c. * Finished repairing Ihe broken toy, 

Subcategorization indicates more specifically than pail of 
speech which words or phrases may occur with the w^ord 
in question. It specifies how many and which arguments 
may occur with a word. Devour must have a noun phrase 



following it in a sentence (devour subcategories for a 
postverbal noun phrase), whereas eat need not: 

4a The tiger devoured its kill. 

b. ! The tiger devoured. 

c. The tiger ate its kill 

d. The tiger ate. 

Subcategorization also allows us to determine which 
verbs may occur where in verbal clusters: 

5a. They may have left the party already 

b, * They may left t he parly already. 

c. 'They may have Could left the parly already. 

Words must be specified in sufficient detail that the 
natural language processing system can draw distinctions 
such as those indicated above. 

HP-NUs Lexicon 

The grammalieal theory behind the HP-NL system is 
HPSG (head-driven phrase structure grammar)/*- 5 In this 
theory as in most other modern linguistic theories, full 
specification of linguistic information at the w T ord level is 
essential. 

Many w r ords have a great deal of linguistic information in 
common. For instance, in example 4 above, each verb 
subcategorizes for the same kind of subject and object, 
but the object is obligatory in the case of devour, and 
optional in the case of BCtL In example 2 above, we see 
that English present Unrd-person-singular main verbs end 
with -.s\ 11P-NL captures these and other linguistic similar- 
ities of words through a system of hierarchical word 
classification. 1 ' 

HP-NLs lexicon consists of word entries and word classes 
arranged in a tree hierarchy (the word class hierarchy). 
Each nonleaf node in the word class hierarchy is a word 
class. A word class defines a collection of words that 
share some cluster of linguistically relevant properties, 
which are predictive or descriptive of the linguistic 
behavior of ihe words. The words may be similar mor- 
phologically and/or syntactically, and may have similar 



58 Jutti- 19^2 Hewlett-Packard Journal 



)Copr. 1949-1998 Hewlett-Packard Co. 



SUBCAT 



UNSATURATED 



WORD 

features - (ex plus 



parents 


- subcal pa 


com pie ments 


- subject fe< 


features 


- su fa cat plus 


gfurt spec 


- subject 




syntax 




features 




su treat minus 




ma f n 




npfojm nuim 




com pi thai whether for 



VERS 



ptmtm 
features 



major 

maj v 



obhg T 



INTRANSITIVE 
parents - unsaturated 



SASE 

parents 
features 



verb 
form bse 




WORK -1 



parents 


- Base Intransitive 


spellings 


- "work" 


semantics 
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lex minus 
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Fig. 3. An example af a wtfrd fentry (sybc^godz&tiort &t fciofl foi 
the word work j ejceerpted imin a weird class hierarchy. 

subcalegorrzation, A won I class partitions the lexicon into 
two sets of words: those I hat belong to (hi* word class 

and those that do not. The characteristics of a won! class 
are defined by the characteristics thai iis members share. 

Word classes are more general closer to the root of the 
word class hierarchy, and are more specific closer to the 
leaves. Word entries are the leaves of tin* wurd class 
hierarchy. The wtjnl entry itself contains spelling, word 
class membership (parent), and idiosyncratic linguistic 
information (see Rg. ;i). 

The complete Linguistic spe< -ilication of a word is estab- 
lished through instantiation (Fig. 4). In this, the linguisiic 
information Of the word is unified with the informal inn of 
all nj its parent word classes. Any possible conflicting 
information is resolved in favor of the more specific 
information. 

Lexicon Development 

The development of this extensive word classification 
system, Hie wold class hierarchy, makes the creation of 
HP NL lexical entries fairly easy. Lexical development 
consists of detenrtining fche word class to which a word 
t>< -longs ajul identifying ihe word's idiosyncratic behavior 
vvilh respect to that word class and of recording the 
spelling and idiosyncratic linguistic behavior of I be word. 

IXen with this powerful lexical tool, creating lexical 
entries is lime-consuming. First, each wind to be entered 



into the lexicon must be analyzed linguistically for word 
class membership and idiosyncratic behavior with respect 
fn 'hat word class. And second, it is difficult for the 
lexicon developer to know which words a user of a 
natural language processing system will want to use, and 
w T hich should therefore be entered into the lexicon. Of 
course, lexical tools such as desktop dictionaries and 
frequency listings can help the lexicon developer, but 
nonautomatic lexicon development is still work-intensive. 

Because of thts T a hand-buth lexicon must be small or 
labor-expensive. And because the linguistic coverage of a 
natural language processing system is limited by the s 
of its lexicon, the narrow* coverage resulting from the 
small lexicon could result in failure of the natural lan- 
guage processing system caused solely by unrecognized 
vocabulary. Natural language processing systems used as 
computer interfaces are intended to allow the user 
maximum freedom in expression. 

Tn address the problems of identifying the most common 
words of English and specifying their linguistic behavior, 
HP-NUs lexicon was augmented with dictionary data 
obtained from the CELEX electronic dictionary. The 
CELEX electronic dictionary was chosen for three rea- 
sons. First, the linguistic classification system is compat- 
ible with modem linguistic theory. Second, the data is 
fully accessible. And third, the CELEX electronic dictio- 
nary provides the frequency data needed to identify 
common words. 

Lexical Extension Using the CELEX Dictionary 
Several advantages were expected from using the lexical 
information tn the CELEX electronic dictionary. First, the 
primary objective in this effort was to increase the 
linguistic coverage of IW -NL by increasing the size of the 
HP-NL lexicon wilh externally compiled dictionary data 
from the CELEX electronic dictionary. Until CELEX was 
integrated into IN' NL. HP-NL's basic lexicon contained 
approximately TOO root words (about L500 words in all, 
Including those derived from the root words hv lexical 
rule) Because the only sentences that can be parsed arc 
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Fig, 4. An example of an Instantiated word entrj 
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parents 
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LEX-RULE-PLURAL 

old-parent - singular 
new-parent - plural 

spellings - suffix -s 

Fig. 5. An excerpt from ih< ••■,. rd C aas hierarchy tor the Singular 

and Plural word classes, with lexical rule. 

those whose component words have been specified 
explicitly in the dictionary, HP-NLs rather small dictionary 
also necessarily meant narrow coverage. The CELEX 
dictionary has both more words than HP-NL (the CELEX 
dictionary has 30,000 lemmas and 180,000 w uniforms) and 
more different word uses for many of the words, 

Second, because the developers of the CELEX dictionary 
drew on learning dictionaries of English (which attempt 
to cover core vocabulary) for dieir lexical data, the CELEX 
dictionary- represents the most common words in English, 
Assuming that natural language processing users will 
prefer common rather than unusual words, using the 
CELEX data should eliminale the need on the part of 
HP-NL lexicon developers to guess at the words commonly 
used by natural language processing users. 

Finally, we believed that buying a license to use the 
CELEX dictionary would be cheaper than creating a large 
lexicon ourselves. 

All of these expectations were met. The CELEX dictio- 
nary was dearly the best choice among the candidate 
electronic dictionaries. The data is accessible, unambigu- 
ously categorized, and extensive, it recognizes many 
lexical classes of interest to linguists, and (he fee for 
commercial clients is reasonable. None of ! he other 
candidate dictionaries had all of these qualities, 

Procedure 

In the work reported here, the orthography, language 
variation, phonology, inflectional morphology, and syntax 
data from the English CELEX database was used. 

To integrate a large pott ion of the CELEX dictionary into 
the IIP-NL dictionary, we transduced CELEX spelling, 
syntactic, and morphological information into a form 
compatible with the IIP-NL system by mapping the 
CELEX dictionary's word classifications onto the (often 
more detailed) word classes of HP-NUs lexical hierarchy. 

Several mappings between the CELEX dictionary's word 
classification scheme and HP-NCs word classes are 
straightforward. For instance, the CELEX dictionary's two 
classes called count nouns (C_N) and uncounr nouns 
(Unc_N) correspond to HP-NLs three classes Smgular. Plural. 



and Mass nouns A count noun can be pluralized and takes 
a smgular verb in its singular form and a plural verb in its 
plural form Examples from the CELEX dictionary are 
uhtmud(s). bookwcrm(s) t and chum(s). 

6a. Her chum was waiting for her at the corner, 
b. Her churns were playing tag when the cat got stuck 
in the Iree. 

The two word classes in IIP-NL that together correspond 
to the C_N w r ord class are the Singular word class and the 
Plural word class, which are relaled by the plural lexical 
rule (Fig. 5). 

The CELEX uncounl nouns are those nouns that occur 
only in the singular form with a singular verb. This 
includes mass and abstract nouns. Examples of Unc_N 
nouns from the CELEX dictionary are bread, cardboard, 
and integrity. 

7a. How r much cardboard is in that box? 
b. * How r many cardboards are in that box? 

8a. The ruler has great integrity, 
b. * The ruler has great integrities. 

The corresponding word class in IIP-NL is the Mass word 
class (Fig, 6). 

Some nouns can be used as either count or uncount 
nouns and are classified as both C_N and Unc_N in the 
CELEX dictionary. Examples are cake, hair, and cable: 

9& I low much cake would you like? 
b. How T many cakes are on the table? 

These are classified as both Singular (and therefore also 
derived Plural') and Mass in HP-NL, 

This portion of the mapping between the CELEX dictionary 
and HP-NL is simple: 

C_N +* Singular (and by derivation, Plural) 
Unc N ** Mass 

This shows an apparent one-to-one mapping. However, 
some of the remainder of the CELEX dictionary nouns 
also map onto the Singular word class. 

Sing_N for Nouns: Singular L'sc 
Plu_M for Nouns: Plural Use 
GrC_N for Nouns: Group Countable 
GrUnc_N for Nouns: Group Uncountable 

The Sing_IM and GrUnc_N classes both map onto the Singular- 
only IIP-NL word class, and therefore these words have no 
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related Plural form. The GrC _N class maps onto The Singular 
word class, so these words have a relaxed Plural form. 

If all of the many-to-onc mappings had been multiple 
CELEX word classes collapsing down to one HP-NX class, 
the transduction would have been perhaps difficult to 
untangle, but possible. Instead, as we will see in the next 
section* in some cases HP-NL recognizes more word 
classes than the CELEX dictionary. This means that 
information that HP-NL needs for accurate parsing is not 
provided by the CELEX dictionary, 

Difficulties 

While using the CELEX dictionary did pay off in the 
expected ways, there were some difficulties that the user 
of electronic dictionaries should be aware of. 

Finding the correspondences between the CELEX dictio- 
nary's word classes and HP-NLs word classes was surpris- 
ingly complicated. Part of the problem was sketchy 
documentation: some of the word classes were underde- 
scribed. and in one case, the documentation inaccurately 
described the data, switching two columns < this error has 
since been corrected). Also, some of the linguistic distinc- 
tions the CELEX dictionary recognizes are orthogonal to 
the distinctions HP-NL recognizes. Furthermore, some of 
The correspondences between the CELEX dictionary and 
HP-NL word classes involved one-to-many mappings in 
which the CELEX dictionary recognizes fewer word class 
distinctions than HP-NL requires, 

Unclear CELEX Classification. The documentation provides a 
shod one-to-three-word description for each of ihe 
syntactic word categories. In some cases, the description 
clearly describes the syntactic behavior For instance, the 
example above dem oust rating the mapping between the 
the CELEX dictionary C_N and Unc_N classes and the 
IIP-NL Singular, Plural, and Mass word classes shows a case 
in which the somewhat slim documentation was adequate- 
ly informative. 

In the case of verb subcategorization) however, the 
documentation is not informative enough, The ( ELE\ 
dictionary recognizes eight verb sul (categorization classes: 
transitive, transi live p I us c o n m 1 e 1 1 1 e 1 1 1 at i on , i n t ran siti ve t 
dilransilive, linking, prepositional, phrasal prepositional 
and expressions! verbs. Exactly whai syntactic behavior 
is meant by each of these classes is unclear, however, 
although sample words are given in addition to a desciip 
tion for each word class The following sample words 
occur m the CELEX doni mentation: 

Trans_V: Transitive 

rrash: he rrusftetf the ear 

admit: he admitted th<it he was wrong 
nr ii | i/cle : * h e eyt Ird (hr b \ I. • 
TransComp_V; Transitive plus Complementation 

found*, the fury found him guilty 

make: they had made him chairman 

lntrans_V: Intransitive 

alight: he got the tms and alighted at the ( ity fluff 

it tut -. she ffjt <t tn~t( 

She left at ten urhtrk 

not modify: ■ he modified 



Drtrans_V: Ditransr 

envy: he envied his cotieagh 
tell: she told him she would keep in touch 
Link_V: Linking Verb 
be: I am a dm 
look: sh*> looks worrit 
Phrasal: Phrasal Verb 
Prepositional 
to 
eons is! of 
Phrasal prepositional 
walk away tvith 
cry out against 
Expression 
toe the Une 
belt the cat 

In English, there is a group of verbs that occur with two 
arguments. Sometimes the second postverbal argument 
must be a noun phrase, sometimes it must be a to preposi- 
tional phrase, and sometimes it may be either. Consider 
the following uses of give, e.vplain. and begrudge: 

10a. The girl gave a book to her younger sister, 
b. The girl gave her younger sister a book. 

c.The girl explained the siory to her sister 
d + *The girl explained her sister the story. 

e.^Thc girl begrudged her new ball to her sister, 
f. The girl begrudged Iter sister her new ball. 

The linguistic behavior of these words is clear. However, 
the CELEX documentation does not clarify which class 
should contain the use of give in 10a and explain in 10c. 
which cart aecepl either a noun phrase (NP) and a pp-to 
phrase as the second argument. Furthermore, it is unclear 
from the documentation whether the CELEX dictionary 
m :ognizes the two uses of give as being related to each 
other. 

Inspecting the CELEX data also yields no Clear indication 
whether the dilransilive class includes verbs with only 
two NP arguments like hrgn/tfgr. with only one NP ami 
one pp-to argument like explain^ ox with both tike give. 
Perhaps all three verb types are di transitive, perhaps only 
those that alternate, and perhaps only those that accept 
only two noun phrase complements. Inspecting I he 
classification of these three words themselves yields little 
more insight. Many words have multiple syntactic behav- 
iors so That it is difficult to tell exactly which syntactic 
behaviors which CELEX word classification is intended to 
cover. 

Following are some olher examples that demonstrate ihe 
difficulty of ascertaining the intended meaning of the 
CELEX verb classes: 

11a. I waited all day for you! 
b. The patriots believed that their government was 
right. 

It is unclear whether wait for is a transitive verb (the 
pp-for argument being considered an adjunct), a ditransi- 
live vert' uhe pp-for argument being considered a second 
complement), or a transit ive plus complementation verb 
Uhe pp-for argumeul being considered a iiiiseellaueuns 
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complement). Similarly, in the case o£ believe t it is un- 
clear both from documentation and from inspecting the 
data whether the sentential argument of believe causes 

this verb to be transitive, transitive plus complementation, 
or intransitive. 

Orthogonal Classification. The CELEX dictionary recognizes 

some categories that are orthogonal to the kinds of 
categories thai HP-NL recognizes and requires For 
Instance, the CELEX dictionary recognizes linking verbs, 
which link a subject / with a complement that describes 
that subject a doctor in a sentence like / am a doctor. 
These subject complements can take the form of a noun 
phrase (She is an intelligent woman), an adjective 
phrase {She looks worried), a prepositional phrase {She 
lives ifi Cork), an adverb phrase (How did she 0jf\d up 
then?) or a clause {Her main intention is to motti 
somewhere else}" (CELEX Users' Guide) 

The distinction between linking and nonlinking verbs is a 
semantic one, HP-NLs word class hierarchy draws mor- 
phological and syntactic distinctions, hut not semantic 
ones. So this information must be identified as not being 
useful (currently) and discarded. 

One-to-Maitv Mapping. Some of the distinctions flu- CELEX 
dictionary draws are useful but not extensive enough for 
HP-NLs purposes. For instance, linguistic theories distin- 
guish between two types of phrasal verbs: raising and 
equi verbs. 

Raising: 

l-;i Tin- sludi'iM s<vms <<> be hc;ih h\ 

b. There seems to be a healthy student, 

Equi: 

13a The student tried to climb the tree. 
b/ 1 There tried to climb the tree tile student. 

The CELEX dictionary does not draw this distinction. 
While the verbs seem and try are indeed present in the 
database, not all of their syntactic behavior is docu- 
mented To use the CELEX dictionary, son if of the data 
would have to be augmented from some other source. 

One large group of words is underspecified in the CELEX 
dictionary with respect to the HP-NL natural language 
processing system: I he members of the closed word 
classes, those classes of grammatical words to which new 
words are seldom added. Examples are prepositions such 
as of and to, determiners such as the and even/, and 
auxiliary verbs such as be and could. These grammatical 
wools cam 1 a great deal of information about the linguis- 
tic characteristics of the phrase in which they appear, and 
must therefore be specified in detail for a natural lan- 
guage processing system. 

Outcome 

Despite the difficulties noted here, incorporating the 
CELEX dictionary into HP-NL turned out to be not only 
profitable for the system but also Instructive. The vocabu- 
lary of the HP-NL system was increased Groin about TOO 
basic words to about 50,000 basic words, 

The addition of the words greatly increased the number 
of sentences that could be parsed. This increase, howev- 
i ■?'. Desalted in an overall slowing of parsing, because of 



lexical ambiguity. This both slowed word retrieval and 
Increased the number of possible partial parses. 

Only words in the open classes (noun, verb, aojective) 
could be added. The HP-NL system requires a lexical 
Specification thai is too theory -specific for the very 
important grammatical wools, as w^ell as for some mem- 
bers of the open classes. However, many members of the 
open classes could be correctly represented in the HP-NL 
format. 

The HP-NL project w>as terminated before user studies 
could be conducted that would have determined whether 
the CELEX dictionary provides the words a user would 
choose while using a natural language processing system. 

Computational Applications of Electronic Dictionaries 

This case study, done using a large electronic dictionary, 
suggests that electronic lexographical information tan be 
incorporated successfully into nondictionary applications. 
First, we found that the CELEX data is in such a form 
that it can be transformed for and accessed successfully 
by a software application. Second, the data in I he CELEX 
dictionary is useful in the domain of natural language 
pruct'ssmg. Thr areas of sun ess and difficulty in incorpo- 
rating the CELEX dictionary into the HP-NL system 
should indicate which kirn Is of software applications 
could successfully integrate an electronic dictionary. 

The greatest gain from the CELEX dictionary was in 
increasing HP-NLs vocabulary dramatically. Although the 
vocabulary increase also resulted in slow r er parsing, the 
larger vocabulary w r as still seen as an improvement, 
because the larger vocabulary greatly extended HP-NLs 
natural language coverage. For an application that does 
not seek to have wide vocabulary coverage, a large 
dictionary would clearly not provide the same large 
advantage 

Another improvement to HP-NL is in I he particular 
vocabulary represented, The CELEX dictionary provides 
common English words, which are the words HP-NL 
needed. An application requiring an unusual vocabulary 
(for i nst at ice. a vocabulary of technical terms) would not 
benefit from the CELEX dictionary as much as did IiP-NL. 

The largesl problem in using the CELEX dictionary w r as 
inadequate information for some word classes. Some of 
the documentation was not completely clear, and some 
words w r ere not represented in the detail required for 
successful parsing by HP-NL. litis rendered some of the 
CELEX dictionary's information useless for HP-NL, This 
did not present a great difficulty; many of the problematic 
words had already been created for IIP-NL. Of course, 
not all applications of dictionary technology will be in 
such an advanced linguistic slate as HP-NL, An applica- 
tion of dictionary' technology has the most likelihood of 
being successful, at least in I he near term, if it does not 
require verv line categorization of words, particularly 
closed-class words- 
One topic that was not addressed iti the current study is 
the role of word meanings in a soli ware application, The 
CELEX dictionary contains no definition information. 
Therefore, its words have no meaning with respect to a 
particular domain such as querying a particular datable. 
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Mapping the meanings of the words from the CELEX 
dictionary to a domain must stilt be done painstakingly hy 
hand. A thesaurus could potentially provide large groups 
of synonyms for words that are defined by hand with 
respect to ait application AJ this point, however, the most 
successful application of electronic dictionary technology 
would avoid the problem of meaning entirely and use 
words themselves as tokens. 

In summary, the kind of software application most likely 
to benefit from electronic dictionaries would require a 
large vocabulary of common words. 

Choosing Applications 

Now that we have some idea of the characteristics of the 
soft wan 1 applications that might benefit from an electron- 
ic dictionary and what kinds of problems might be 
encountered in incorporating the dictionary into the 
application, we can consider what particular applications 
could use electronic dictionaries. 

To review, software systems of the following types could 
henefii from the data in an electronic dictionary: 

• Any software system that uses natural language in any 
way 

• A software system that requires a large vocabulary of 
common words 

• A software system trial (iocs not require detailed Unguis* 
tic specification of grammatical words 

■ A software system thai does in it make use of word defini- 
tions 

• A software system that does not require complete linguis- 
tic analysis of words unless it supplies its own categoriza- 
tion scheme (such as TIP-NL I 

Some types of software applications thai match these 
characteristics are natural language processing, speed) 
generation and recognition di mi intent input, document 
management, and in format tun retrieval. 

The electronic dictionary in turn should possess the 
following cliaracl eristics: 

• Data accessjhle to software application (riot encrypted) 

• Good data organization (so that access is easy and flex- 
ible) 

• Appropriate vocabulary ( for instance, good coverage of 
the core vocabulary of English) 

• Appropriate additional information if<a instance, modem 
linguistic classification for natural Language processing 

systems ). 

Qf the dictionaries we have surveyed, lew satisfy the first 
miuireiiieni. The CELEX dictionary. LDOCE, and several 
word lists have aceessihle data. These elect ronie dictio- 
naries wiry in the degree to which they exhibit the other 

characteristics. 

Natural Language Processing. Natural language processing 
systems are the software applications whose need for 

electronic dictionaries is mOSl extensive and tuns! nh 
vidua. The information necessary is spelling, morphology, 
pari of speech, and subcategoriaation, at least, A more 

1 i 'nsive discussion of the role of electronic dictionaries 
in natural language processing systems was presented 
earlier in this paper, 



Speech Technology In both speech generation and speech 
recognition, a vocabulary list and associated pronunci- 
ations are essential. Depending on the sophisli cation of 
the speech system, other linguistic information may also 
be useful 

If the speech generation system is to generate words 
alone, a word list with pronunciations is sufficient, hut 
if it must generate fuU phrases or sentences >ponianeous~ 
ly rather than from a script, a natural language generation 
ni is necessary. This genera! ioi may be based 

on linguistic theory or it may be based instead on tem- 
plate forms, but in either case, an electronic dictionary- 
could provide the word classification information 
necessary. 

Speech recognition systems that recognize one-word or 
canned commands also need no more than a word list 
with pronunciations. However, if a speech recognition 
system must recognize spontaneously created phrases, a 
more sophisticated approach la recognition is necessary. 
After the word possibilities have been identified, there are 
several ways in which the speech recognizer can Identify 
potential sentences: 

• By ruling out ill-formed sentences on the basis of impossi- 
ble word- type combinations 

• By rating possible sentences cm the basis of collocation 
information derived from a statistical survey of texts 

• By parsing with a natural language understanding system. 

Of these, the first and last possibilities would require 
word class information in addition to pronunciation 
infonuatiom which can be gained from electronic diet in 
naries currently available. The second possibility would 
require data from a statistical study, preferably performed 
on texts from the relevant domain. 

Document Input Kxamples of computer applications that 
facilitate dnt ument input are optical character recognition 
[OCR), "stuntf keyboards, and dictation aids. Document 
input is error-prone. One of the many ways io reduce 
errors is to allow the computer access to linguistic 
information. 

Such an application would need a word listing, a I henry 
ttf i he errors likely to be made hy the system, and a 
theory of the relative frequency of appearance of well- 
formed subparts Of words. A more sophisticated system 
might recognize multiple word blocks, requiring the 
linguistic module to provide either word classification 
mini mat inn (for parsing-like ability) or word collocation 
ml* nutation [ for statistical mlnrniation on word co- 
occurrence). 

The HP-UX* Operating system provides a minimal "smart" 
keyboard faeilin in l he csh environment, with an escape 

feature for completing known commands, This feature 

could be expanded for full English, and could intitule not 
nnly word completion, I ml also partial word completion 
Thai is, I lu 1 application could have some knowledge of 
the frequency of substrings in spellings (or its equivalent 
in speech), and with this knowledge could reduce the 
mrmher of keystrokes necessary for input. 
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When speech recognition technology advances sufficiently, 
the opportunity for dictation aids will arise. These tools 
could perform a similar fund ion to smart keyboards, but 
in the realm of spoken rather than typed language. 

Optical character recognition is one of the most promis- 
ing areas in which electronic dictionaries could be used. 
At least one language-based assistance product for OCR is 
available commercially; OmniSpell, a spelling corrector for 
use with the OCR product OmniPage. It suggests likely 
alternate spellings for strings not recognized as words. 

Document Management. Linguistic information can also aid 
in checking and improving the quality of a document 
stored on a computer. Spelling checkers, based on com- 
mon typographical errors and variation of the misspelling 
from well-formed words, as well as on phonological 
characteristics of the misspelled word, are available 
already. 

Grammar and style checkers, however, are not available 
in as great abundance or varieiy. There are grammar 
checkers available such as the Grammatik spelling check- 
er, but i hey focus primarily on form statistics (average 
word length in a sentence, average syllable length of 
words) and on frozen style characteristics (use of idiom- 
atic expressions and cliches, use of passive'). They are 
notably not very good at identifying errors in grammar, 
such as lack of subject-verb concord with complex 
subjects, 

14. *Tht? teacher but not the students are happy with the 

football team. 

choice of correct pronoun case in complex prepositional 
objects, 

15, * Bet ween John and I, lie works more, 

and similar subtle points in grammar,* 

Natural language parsing technology could improve the 
performance of grammar checkers, and an electronic 
dictionary would be an important part of such a natural 
language processing system. Otherwise, an electronic 
dictionary indicating pail of speech and other relevant 
grammatical information such as verb conjugation class, 
noun phrase number, and pronoun case could be useful in 
heuristic inspection of sentences. 

Information Retrieval. Information retrieval capability could 
be expanded by incorporating a theory of related words 
into information retrieval systems. While litis expansion 
may not be necessary for standard databases in which 
words have a formal meaning and are not really natural 
language items, it could be very useful in full- text data- 
bases. There are two kinds of information that could 
expand retrieval possibilities. 



TOne well-known grammar checker incorrectly Kteotified both of these sentences as being 
grammatical 



First, a user might search on a keyword but be interested 
in retrieving all occurrences of that word in related 
morphological forms; 

16a. factory / factories 

b. goose / geese 

c. happy / happiness 

A morphological analyzer module thai can recognize 
morphologically-related words, either through exhaustive 
listing or through some theory of morphological variants, 
might expand retrieval possibilities. 

Second, a user might be interested in retrieving all 
information on a particular topic, but that lopic might be 
identified by several different synonyms. For instance, the 
user might want to retrieve all men I ions of animal in a 
text. A thesaurus would permit the user to retrieve men- 
tions of creature and beast, and perhaps also subtopics 
Such as ntatttmaf. aTwpkibicm, and rap file, 

Conclusion 

Electronic dictionaries have recently reached a state of 
development which makes them appropriate for use on 
the one hand as machine-readable end-user products, and 
on the other hand as components of larger language- 
based software systems. There are several domains of 
software applications that either are already benefitting 
from electronic dictionaries or could benefit from eler 
tronic dictionaries that are available now. One project at 
Hewlett-Packard Laboratories has successfully integrated 
one electronic dictionary, the CELEX lexical database, 
into its natural language processing system. Other soft- 
ware applications that could use the extensive informa- 
tion available in electronic dictionaries are speech genera- 
tion and recognition, document input such as optical 
character recognition and "smart" keyboards, document 
management such as spelling and grammar checking, and 
information retrieval. 
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4 




Diana C. Roberts 

3 Roberts received her 
BS degree \n psychology 
from the Georgia In* ■* 
Technology in 1982 and ner 
MA degree ;n linguistics 
tram Cornell Unman 
1391 She joined HP Labora- 
- tones in 1 985 as a member 
^^^^^^^^^ of the technical staff and did 
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Jinn' VM1 I IfAlt'll-Pui-karri Journal f>7 



)Copr. 1949-1998 Hewlett-Packard Co. 



Application of Spatial Frequency 
Methods to Evaluation of Printed 
Images 

Contrast transfer function methods, applied in pairwise comparisons, 

differentiated between print algorithms, dot sizes, stroke widths, 
resolutions (dpi), smoothing algorithms, and toners. Machine judgments 
based on these methods agreed with the print quality judgments of a 
panel of trained human observers. 

by Dale D # Russell 



Certain aspects of printed Images lend themselves to 
analysis by spatial frequency methods. The ultimate goal 
of I his type of analysis is tt> derive a single figure of 
merit from a test pattern that is sensitive to the overall 
performance of the printer system J The printer system 
includes the firmware, hardware, and software, as well 
as the print engine with its associated colorant/paper 
interaction. 

The value of the modulation transfer function (MIT) in 
defining optical systems has been demonstraled for 
decades. As early as L966, photographic resolving power 
was shown to be an inadequate measure of system 
performance and the MTF has been increasingly used.- 
Similarly, the resolution of a printer in terms of dots per 
inch (dpi) is not adequate to describe (he performance 
and fidelity of the printer through the whole range of 
spatial frequencies that must be rendered. A consideration 
of resolution alone Tails to lake into account either the 
lower-frequency fidelity or the limiting effect of the 
human eyeJ 

The MTF generates a curve indicating the degree to 
which image contrast is reduced as spatial frequency is 
increased. Unlike resolution, MTF gives a system re- 
sponse with values from zero to a finite number of cycles 
per millimeter, thus filling in information about the low 
and middle ranges of the spatial frequency spectrum. 

Strictly speaking, continuous methods such as the MTF 
and the contrast transfer function (CTF) do not apply to 
discrete systems such as a digital printer, and applications 
of these functions to discrete systems typically meet with 
mixed success. The MTF and CTF assume a system that 
is linear and space and lime invariant. Any system with 
fixed sampling locations (such as a 300-dpi grid) is not 
space invariant, and sampling theory must be judiciously 
applied to characterize it. Printers noi only digitize data, 
but most printers binarize it as well, making interpola- 
tions Of values impossible. This introduces what is 
essentially a large noise component mid gives rise to 
moire patterns on periodic data. 



On the other hand, spatial frequency methods offer a 
great advantage in that the transfer functions for individu- 
al components of a system can be cascaded (i.e., multi- 
plied together) to yield the transfer function of the 
system (with some exceptions). Provided thai a system is 
close to linear, as it must be if the printed image is to 
look anything like the intended image, then multiplying 
component MTFs point by point adequately predicts a 
Complete system MTF. 1 If MTF methods can be adapted 
to discrete systems, then the overall transfer function will 
exhibit a sensitivity to changes in the transfer functions 
of all system components. This sensitivity can be ex- 
ploited, for example, to diagnose problems or to evaluate 
new printer designs. 

The modulation transfer function is the modulus of the 
system optical transfer- function, which is the Fourier 
transform of the system point-spread function/* While the 
MTF of a component or system is easier lo calculate, 
experimental work is generally based on measure merit of 
the CTF. This function is then compared to the theoretical 
performance of a system to yield a single figure of merit 
for that system, a printer commanded to prim a 50% fill 
pattern consisting of lines will reach a limit of spatial 
frequency at which it must overprint the designated area. 
This results in increasing average optical density with 
increasing spatial frequency, as observed in the test 
patterns, The CTF is based on contrast, or the difference 
in reflectance of the printed ;ind imprinted portions of the 
tesi pattern. As the white space is increasingly en- 
croached upon by the printed area, or increasingly filled 
with spray and scatter, the contrast is degraded. This 
results in a loss of print fidelity and a concomitant 
decrease in the value of the CTF at that frequency. In the 
limit, contrast and CTF drop to zero. 

In addition to printer limitations, the human eye, with 
discrete receptors, has a spatial frequency limit of sensi- 
tivity This cutoff point sets a practical limit on the need 
for improved contrast in a printed image. Furthermore, 
the contrast sensitivity curve for the human eye. when 
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considered as part of the total system, can be convolved 
with the CTF curve for the printer to assess the relative 
importance of improvements in contrast as perceived by 
the human obser 

Integrating under the CTF curve through the pertinent 
frequency 7 range gives a single value of the function for 

r his figure can tie compared with a standard 
that represents the theoretical perform, . r»tain a 

figure of merit for the system. When the t TF-derived 
figure of merit correlates with one or more parameters as 
evaluated by the human observer* then the additional 
advantage of being able to predict human response with a 
machine-graded test is realized. 

Fourier transform methods are closely related to the MTF 
and are also applicable to print quality evaluation. Where- 
as the CTF experiment shows printer performance using 
a particular test pattern, the Fourier transform can in 
principle be applied to any printed image. The Fourier 
transform in this case takes intensity-versus-position data 
and transforms it to the spatial frequency domain. When 
a Fourier transform for an image is compared to the 
Fourier transform for a "standard" image, the dropout 
areas reveal which frequencies are Got transferred by that 
printer This can be used to determine limits of printer 
performance and to identify sources of change in printer 
performance. 

A number of advantages are associated with the use of 
the Fourier transform for image quality analysis, First, the 
freedom to select \irtually any character or image means 
that exactly the same image can be presented to both the 
human and machine observers. Greater control over the 
experimental variables is possible, since the very same 
page is evaluated, rather than a tesl target for the ma- 
chine vision system and a page of text or graphics for the 
human. Printing two different pa Herns on two separate 
pages at two different times necessarily introduces 
uncontrolled and even unknown variables thai may 
influence print quality measurements. Use of ihe Fourier 
transform for this analysis can eliminate at least some of 
these variables. 

With fast Fourier transform algorithms available today, the 
lime to transform an entire frame of image 1 data is only a 
nun ute or so. This makes the time required for analysis 
by this method considerably less than that for a complete 
workup of the spatial frequency sweep lest target. Given 
the freedom to select any image for the Fourier trans- 
form, attention can be focused on the most egregious 
visible defects in printer performance. This should further 
reduce the time required for analysis and ultimately for 
printer system improvements. 

This paper discusses the development and application of 
various test patterns to black-and-white print quality 
evaluation with extension lo color print quality evaluation. 
A (mined panel of judges evaluated merged lext and 
graphics samples, and Unit responses are compared with 
Ihe results of the OFF method. In addilion, some exam- 

[es of (he Fourier transform evaluation of printed images 
are given, and are compared to the information from The 
CTF method, 



ExperimentaJ Methods 

Three different test patterns were used to derive contrast 
transfer function data for the printer systems being 
evaluated. The simplest pattern consists of printed lines 
paired with unprinted spares of equal width, in a se- 
quence of increasing spatial frequency. The advantage of 
this pattern is its simplicity. The principal disadvantage is 
that it provides information on the printer system only In 
one axis. Evaluation of the contrast is done using image 
processing software to generate a line profile normal to 
the printed lines. Contrast is determined as indicated 
below, 

The second pattern consists of 50% hatched fiU patterns 
at five selected spatial frequencies (0.85 to 2.0 cycles/mm) 
chosen to lie within the range of human sensitivity (0 to 
— 4.7 cycles/degree at a viewing distance of 12 in] and to 
contain the majority of the spatial frequency information 
for text. Each is presented at seven different angles from 
the horizontal axis to the vertical axis of the page. This 
pattern was analyzed by measuring the average optical 
density and comparing it with the computed theoretical 
optica] density of a 50% pattern, given the paper and 
COiOrant used. Patterns or this type are commercially 
available in print quality analysis packages. 

The most complex pattern evaluated consists of concen- 
tric rings increasing in spatial frequency from the center 
out. This pattern includes all print angles relative to the 
page, but is sensitive to the algorithm used to generate 
the circle. It provides striking evidence of the impart of 
the software on the final rendering of a printed sample. 
In terms of CTF t it reveals the very large contribution of 
the software to the overall transfer function. 

For uniformly sampled spaces, as in a scanner or printer, 
a circular spatial frequency iesi pattern gives a visual 

represent a! ion of the system response al all print angles. 
One effect <rf ihe continuously varied spatial frequency 
mapped onto a fixed grid is the appearance of moire 
patterns, which indicate beat frequencies between the 
grid and the desired print frequency. 4 The moire patterns 
are observed along the pattern axes at frequencies 
corresponding to division of the printers capability in 
dots-per-inch by integer and half-integer numbers. While 
normally used to test the physical reproduction capability 
of a grid and spot point function, the circular spatial 
frequency lest pattern also proves useful for evaluating 
the rendering print quality of graphics algorithms. 

The concentric circle test pattern is shown in Fig. 1, 
which is a 2400-dpi Linotronic imagesetter output de- 
scribed under the heading "Effect of Print Algorithm" 
below. The white dots on the horizontal axis are frequen- 
cy markers with a spacing of OJ cyeles/num There are 
similar markings in Ihe vertical axis wilh a spacing of 10 
cycles/inch. The viewing distance determines the spatial 
frequency on the human retina in <y cles/degree, which is 
the appropriate unil for dealing v% ish human contrast 
sensitivily. To converi from cycles/degree to cycles/mm, 
the following relationship is used: 
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Fig. 1. < -Jijirj'hinr ( i\i I'* test pattern, 2400*dpj Unotrahic 
setter otttplti 



v = 57.:r n 



Where v is ihe spalial frequency in cycles/nun, /* is (ho 
spatial frequency in cycles/degree, and II is the viewing 
distance in mm. 5 

Kivr different black-and-white elect rographic printers, 
with firmware and (oner variations of each, were eva- 
luated. Two color printers, clec.tr ographic and thermal 
transfer, were also compared* 

The color and black patterns were analyzed using a CCD 
(charge-coupled device) color video camera vision system 
and commercially available control software. Digital image 
processing was performed with commercially available 
and iu-house software. Care was taken lo coniru] such 
experimental variables as the distance from the camera to 
the paper, the output power of the lamp, the angles of 
viewing and illumination, and the magnification of the 
image. Measurements were made in a thermostatically 
controlled room so that detector noise and dark current 
would he minimal and relatively constant. Every effort 
was made to eliminate stray light. 

< iptieal densities of the printed lines and imprinted spaces 
were determined along with the spatial frequencies. This 
was done by evaluating a line profile taken normal 10 I he 
printed line. A contrast functinm t . was computed for 
each line-space pair according to the formula:* 5 



C - 



lnia\ J 1 



- L 



where I, nas and I ni j„ are the reflected light intensities of 
the space and line, respectively, as measured by the video 
camera for the tine profile data. These values had a range 
of (no measurable reflected lig&t) lo 2B6 (maximum 
measurable light intensity). 



Color patterns were illuminated under filtered light to 
increase the contrast while keeping reflected intensities 
within the range of to 255 as measured by the video 
camera. Therefore, all values reported for the colored 
samples are relative and not absolute. Data is reported 
here for only one of the three color channels. A complHr 
analysis would include ail three channels. However, we 
found no case in this study where the inclusion of the 
other (wo channels altered a result. This data is normal- 
ized atid presented as percent modulation on the plots. 

By generating rays starting at I he center of (lie test 
target, line profiles can be taken through as many print 
angles as desired, for complete analysis of the test 
pattern. In this work, 10 rays were taken from the center 
to the edge of the target, in the fourth quadrant, at 
iO-degree increments starting with the vertical axis and 
ending wit It (he horizontal axis. The CTF data lor all irn 
rays was computed at the desired frequencies and aver- 
aged to obtain ihe percenl modulation as a function of 
spatial frequency for the sample. The data reported here 
was all obtained by this method and represents an 
average of ihe CTFs at the ten print angles. 

Using in-house software, text and "standard : ' images were 
transformed inlo (he spatial frequency domain. The 
standard images were printed on a 2400-dpi imagesetter 
using scaled hi) maps otherwise identical to the test 
image. Differences between the sample and the standard 
Fourier transforms were computed and the dropout 
frequencies noted. These correspond Jo mathematical 
notch filters applied to the standard at certain frequencies, 

The test image can then be reconstructed by adding the 
dropout frequencies one at a time to identify which 
frequencies are responsible I'm which print defects. The 
defect frequencies can sometimes be attributed to printer 
functions, such as gear noise or mechanical vibration 
frequencies, hi this case, a new engine design or materi- 
als set will be required to correct the printed image. 

Dropout frequencies associated with the sampling fre- 
quency of ihe prinl grid (i.e., dpi), cannot be corrected 
without changing the resolution, and thus represent a 
fundamcnlal limitation. These frequencies can be filled in 
by various resolution enhancement techniques, or the 
resolution of the printer musl he increased. One applica- 
tion of the Fourier transform method is the immediate 
evaluation of various resolution enhancement techniques. 

Human response to print quality was determined by a 
committee of 14 Trained observers. The committee was 
shown samples consisting of merged text and graphics for 
which they graded solid fill homogeneity, contrast, edge 
roughness, edge sharpness. Line resolution, and character 
density, The committee also gave overall subjective 
impressions of each sample at the page level, and ranked 
the samples by making paired comparisons. 

Results 

Effect of Print Algorithm, A number of different algorithms 
were examined for preparing the concern ric circle test 
target. Bresenham's algorithm generates the pattern very 
quickly, but snaps all radii to the nearest Integer value 
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Fig. 2. TVsi targel output from a 300-dpi electiographic printer 
i . [.rinr* t f) with a good matcH between do! size ^n.i resolution 

Several PostScript interpreters were evaluated; some have 
floating-point accuracy, others gave unusual renderings. 
There were considerable differences among I hem. The 
lest target can also be calculated in polar fashion by 
incrementing an angle from to 360 degrees and using 
I ho sine and cosine of the angle to calculate points to 
nnder directly to ;i hit map. This makes it possible to 

use the large! Fur print engine and dpi tests. 

The method I r the resl of Ihis Investigation 

generates a hit map hy using a square rdOl function to 

general e I he circles: 



V = INT*, Radius- - X-i 

for integer values of X. This is computed in one octant of 
the circle and reflected to the others. Fig. 1 is a 2400-dpj 
Linolronic uiitpui of the lesl pattern, Differences in prim 
quality arising from the print algorithm alone could haw 
been evaluated using the (TF method outlined in this 
paper The choice of an algorithm for printing the concen- 
tric circle pattern was based on subjective and qualitative 
visual ranking of the geometric integrity of test pat t cms 
generated as described here. 

Effect of DPI, Dot Size, and Edge Smoothing increasing the 

rlfii of :i pruih i results in improved CTF through a wide* 
range of spatial frequencies, provided dot size is red need 
accordingly. If dot size is held constant only km frequency 
response Ls improved. Fig. 2 is from a 300-dpl elect m 

graphic printer (coded printer P) that has a reasonable 
match fief ween dm size and resolution Features ate 

visually distinguished out to the Nyquisl fmju'nc UtO 
cycles/mm. 

second 300-dpi printer, coded printer U, has a larger 
dot size than printer R Comparison of Fig. 3 with Fig 
shows hiss nl contrast at higher spatial frequencies li has 



Fig, 3. Test t^rg''' -mi ['Hi frorti .i KKJ-dpt electrqgjrapbic printi i 
wall u l;n liter I',' I 

been calculated that a severe degradation of the MTF 
results from even a 5% increase in dot size/' 

Fig. 4 is from printer R with an edge smooth trig algorithm 
applied, and shows improvement at low and middle 
frequencies. At high frequencies, however, there is actual- 
ly loss of contrast as the white space between lines is 
increasingly encroached upon b\ the thicker, smoothed 
lines. The main advantage of this particular edge smooth- 
ing technique lies in the low to middle frequency regions 
where most text information is located. When the prim 




Fig. i. Test targei from priiitei \< with an edge sin thfc 

nrhjii applied 
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Fig. 5. ICtilrirgf-ri" nl 
out edge smoothing 
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quality jury evaluated lexi nnly. they unanimously pre- 
ferred the edge smoothing algorithm. However, when fme 
lines, 50% fill, and other graphics patterns were included 
in the samples, Ihe overall preference was in favor of the 
unsmoothed samples Figs. 5 and 6 are enlargements 
of text samples shown to the jury of smoothed and 
unsmoothed print, 

Figs, 7 and 8 are from 100-dpi and 600-dpi printers, 
respectively. The moire centers are observed to occur at 
local ions on the vertical and horizontal axes correspond- 
ing to the dpi value in dots per inch divided by integers 
and half-integers. As resolution in dpi is improved, the 
moire patterns have fewer discernible rings and appear at 
higher frequencies. Print fidelity is therefore better 
through a broader range of spatial frequencies. Fig. 9 is a 
plo? of the normalized contrast, as percent modulation, 
for three printers with 300-dpi, 600-dpi, and 1200-dpi 
resolution. In general, the moire patterns are evidence of 



Fig, 7, Test Uirgrl mo? put I'rom a 4iki-i!ni Hrttrographir printer. 

print defects, and measures taken to reduce their visibili- 
ty in the test target will result in improved fidelity for 
both text and graphics.^ 

Effect of Toner. Toner particle size can have a measurable 
impact on print quality, 7 and the CTF method can be 
used to evaluate this effect. TWo special toners were 
compared with a standard commercially available toner. 
The special (oners were characterized by having smaller 
average particle size and a narrower particle size distribu- 
tion, A comparison of Figs. 10 and 2 shows the impact of 
this on the concentric circle test pattern. The special 





Fig. 6. Enlargement oj 300 dpi h-xr i 4-poinT^ frum printer R with 

an edge smoothing algorMim applied. Fig. 8. T&& target oupm from a (500-dpi elect rographic printer 
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Table I 

Preferred Graphics Samples 

Results of Human vs Machine-Graded Tests 



Spatial Frequency (cycles/mmf 
1200 dpi A 600 dpi 3300 dpi 

Fig. 9. Percent HKKiokti«masaftmctionofspaU«lfire<p i 
•;.Laiiil t2Q0r4pJ printers 

toners give smoother line edges, less scatter, and conse- 
quently better contrast. The CTF plots in Fig. II illustrate 
I he impact of this over the spai ial frequency range. The 
curve for the special toner remains high through the 
human sensitivity range. The print quality jury- invariably 
preferred the special loners to the standard toner 

Color Samples. Color print samples of the concentric circle 
pattern were general cd by 300-dpi, 4(H) dpi. and -linn dpi 
color printers. Data shown here is tor urcen print sam- 
ples. The choice of a secondary c&lm introduces the 
added parameter of color-to-color registraiion, which can 
be separately evaluated hv the method. The difference in 
resolution and the wider stroke widih of the 300-dpi 
printer combined to make I he 1 1 10-dpi printer clearly 
superior Text and graphics samples judged by the print 
quality jury followed she same order of preference, When 
the 300-dpi printer had its stroke width modified by 
deleting unc pixel width every line, ii became the better 
of i he two printers, aceordlng to the CTF data. This is 
illusinmd in Rg. 12. Ihiman evaluation gave the same 
result. 

CTF Analysis Compared to Human Perception. Five black and 
while printers and iwo color printers were evaluated by 
CTF analysis aiid print quality jury evaluation, two CTF 
methods were compared to human perception- The ftrsl 
was the quick method coveting five frequencies and seven 
print angles, which measured average optica] density to 
approximate the oon&asJ function. This narrow-range 

method has the advantages of simpliciu and Speed, and 

is adequate for many applications- in addition, it has 

correlation with the print quaMtj jmy findings, approxi- 
mately S3#6 tor paii wise comparisons. This data is pre- 
sented in Table I. 



Sam- 
ple 


Parameter 

Tesi- 


Primer 


Print 

ilirv 
Jury 


Quick 


Con- 
centric 
Circle 
Target 


1-2 


Fdge 
Smoothing 


Q 


1 


1 


1 


1-8 


ers 




1 


8 


1 


34 


Toners 


P 


3 


3 


3 


5-6 


Do* Size 


g.R 




G 


6 


5$ 


Edge 
Smoothing 


Q 


8 


8 


8 


6-7 


Toner* 


R 


7 


7 


7 



"Sample set" refers to the code numbers of ihe print 
conditions being compared. The numbers under the 
method headings are the preferred sample in each set. In 
(he case of the two CTF methods, Ihe area under the 
CTF curve is the figure of merit used to predict to 
preferred sample 

The print quality jury consisted of 14 trained observers. 
The quick CTF test used only % spatial frequencies from 
0.85 to 2.0 cycles per mm. and only 7 angles of print a\iv 
The concentric circle target used frequencies from to 
SJ cycles per nun, and II angles of print axis. Graphics 
only are considered in this lest set For the machine- 
graded tests, the integral under Ihe CTF curves was used 
as a figure of merit to determine which sample was 
heller. The preferred sample in each Iwo-sample set is 
listed by number in Table I. 




Fi^. LO.Tcsi target outpui from a 300-dpi prtntei fhawini Mi> 
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Spatial Frequency (cycles/mm} 
O Sample 3 A Sample 4 

Fig. 11. Hoi <j[ pi rci ii' itiochilation as a fonclion uf spatial fre- 
Quern y iV>r prinhr P. Sample 3 was jirini- ■■! w iih a : ■- ■> i- il tone* and 

Sampled was priiiMl w\\\\ s1;iii.3;jf.) rurirj'. 

The 1 concentric circle target method is much more Lime 
and labor intensive, but. has 100% con-elation with the 
print quality jury for this data sol. Since it covers a 
broader Frequency range and more print angles, il does 
distinguish print fidelity more completely. Paired compari- 
son of samples 8 and I (Fig. 18) illustrates this advan- 
tage. The quick CTF method predicts that sample 8 is 
better than sample 1. In the same frequency range, the 
concentric circle method shows slightly better contrast 
for sample 1. However, at higher frequencies, the concen- 
tric circle pattern reveals significantly be tier performance 
for sample 1. The print quality jury preferred sample 1. 
The frequencies through which sample 1 out performs 
sample 8 are within human perception, and appaiendy 
correlate with factors that influenced (he committee. 



Spatial Frequency (cycles/mm) 
Q Sample B A Sample T 

Fig. 13. l J ]oi Of percent modulation as a function fit spatial fre- 
quency tor paired r ■■ anp'-mson of samples 8 n\ ti I I mm [pig 8 is from 
printer Q with the edge smoothijig ulguritluii turned off and stan- 
dard loner. Sample 1 is (he &xm except IliM specfijt] toner was 
used. 

A comparison test of sample 1 against sample 2 also 
shows this effect (Fig. 14). Based on the magnitude of 
the integral under the CTF curves, the quick method 
Shows a very slight difference between the samples with 
1 belter than 2. The concentric circle method, in file 
same range, also gives sample 1 a very slight edge, but in 
the higher-frequency region, sample 1 distinctly outper- 
forms sample 2. The print quality jury overwhelmingly 
preferred sample I. Apparently, this frequency region is 
important la human print quality evaluation and should be 
included in machine-graded tests if the increased likeli- 
hood of correlation with human perception justifies the 
increased time for the test. 







O E 



Spatial Frequency (cycles/mm) 

AQ -QAdjustedtinewidth 



Fig. 12. PtfcJ Qfperc&fil modulation as a iinrii : spatial fre- 
quency for ilirre color (green) test plots. The test pattern for data 
Cttiye "E" was printed with a 400-dpi color printer. The test pattern 
for data fcutve 'Q 1 wasprJitted with ,i -"MO-dpi color printer Curve 
**Q adj. linewidth" is for the 300 dpi-printer with a najfipowear line- 
wirilh. 
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Spatial Frequency (cycles/mm) 
O Sample 1 A Sample 2 

Fig. 14. Plot of peree&t modulation as a function of spatial fre- 
quency for paired comparison of samples 1 and 2, Sample 1 is from 
printer Q with special toner and resolution enhancement Technique 
off. Sample 1! is the B&m >•■■. ,u e*(ge sEi'MM-rfiiniSilgonUirn 

is applied 
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2400 dpi Image 



300 dpi Image 



Reconstructed Image 



Fig. 15- Enlarged 12-point text showing a 2400-dpi original sample, a 300-dpi original, and the same 2400-dpi sample after being Fourier 

transformed, filtered, and reconstruct I'd. 



Fourier Transform Results. In Fig. 15, three linages are 
compared. The first is a 2400-dpi image, which has been 
chosen to represent an * ideal' 1 image. The second is 
300-dpi output of the same bit map which has been 
scaled to accommodate the change in addressability. The 
third is the same 2400-dpi image which has been trans- 
formed, filtered, and reconstructed to resemble the 
300-dpi image, The filter notched the Fourier transform to 
approximate the frequency limitations of the 300-dpi 
printer Mathematical addition of some of the spatial 
frequency components back into the notched Fourier 
transform with subsequent inverse transformation, shows 
which frequencies are responsible for which print defects. 
When the source of tfae frequency dropout is identified, it 
can either be corrected or accepted as a fundamental 
limitation on printer performance. The transforms of I he 
two images may also be subtracted from each other, with 
the difference corresponding directly to spatial frequency 
limitations of the 300-dpi printer. 

Conclusions 

( TF methods, applied here in pairwise comparisons, 
differentiated between algorithms, dot sizes, stroke 
widths, dpi t edge smoothing, and toners, in addition, the 
method shows whether system changes will be expected 
to improve text, graphics, neither, or both, based on the 
spatial region in which the CTF response is altered. 

The Fourier transform method is useful for identifying 
spatial frequencies that affect various image characteris- 
tics, It also demonstrates usefulness for predicting where 



the fundamental limitations of the printer have been 
reached. This will have an impact on engine design. 

in all comparisons of printed samples, the results corre- 
sponded to the overall subjective preferences of a trained 
print quality panel. From this it is concluded that this 
method shows promise as an automated print quality 
analysis technique, with application to both black and 
white and color printers. 
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Parallel Raytraced Image Generation 

Simulations of an experimental parallel processor architecture have 
demonstrated that four processors can provide a threefold improvement in 
raytraced image rendering speed compared to sequential rendering. 

by Susan S. Spach and Ronald W, Pulleyblank 



Computer graphics rendering is the synthesis of an image 
from a mathematical model of an object contained in a 
computer This synthesized image is a two-dimensional 
rendering of the three-dimensional object, It is created by 
calculating and displaying the color of every controllable 
point on the graphics display screen. A typical display 
contains a grid of 1L\S0 by 1024 of these controllable 
points, which are called pixels (picture elements). The 
memory used to store pixel colors is called the frame 
buffer Specialized hardware accelerators available on 
today's workstations, such as HP T s Turbo SRX and Turbo 
VRX products, J can render models composed of polygons 
in real time. This makes it possible for the user to alter 
the model and see the results immediately. Real-time 
animation of computer models is also possible. 

The most time-consuming operation in rendering is the 
computation of the light arriving at the visible surface 
points of the object that correspond to the pixels on the 
schvii. Real lime graphics accelerators do this \>\ Wm^- 
forming polygonalized objects in I he model to a desired 
position and view, calculating an illumination value at the 
polygon vertices, projecting the objects onto a 2D plane 
representing the display screen, and interpolating I he 
vertex colors to all the pixels within the resulting 2D 
polygons. This amounts to approximating the true surface 
illumination with a simplified direct lighting model. 

Direct lighting models only take into account the light 
sources that directly illuminate a surface point, while 
global illumination models attempt to account for the 
interchange of light between all surfaces in the scene. 
Global illumination models result in more accural e images 
than direct lighting models. Images produced with global 
lighting models are often called photorealistic. 

Fig. 1 slums the contrast between hardware shading and 
photorealistic renderings. Fig. la was computed using a 
local illumination model while Figs, lb, le t and Id were 
computed using global illumination algorithms. 

The disadvantage of photorealistic renderings is that they 
are computationally intensive tasks requiring minutes for 
simple models and hours for complex models, 

Rayt racing is one photorealistic rendering technique that 
generates images containing shadows, reflections, and 
transparencies. Raytracing is used in many graphics 
applications including computer-aided design, scientific 



visualization, and computer animation. It is also used as a 
tool for solving problems in geometric algorithms such as 
evaluation of constructive solid geometry models and 
geometric form factors for radiative energy transfer 

The goal of our research is to develop parallel raytracing 
techniques that render large data models in the fastest 
possible times. Our parallel raytracing techniques are 
being implemented to run on the Image Compute Engine 
(ICE) architecture, ICE, under development in our project 
group at HP Laboratories, is a multiprocessor system 
intended to accelerate a variety of graphics and image 
processing applications. ICE consists of clusters of 
float ing-poml processing elements, each cluster containing 
four processors with local and shared memory, The 
clusters are networked using message passing links and 
the system topology is configured using a crossbar 
switch. A prototype system of eight clusters is under 
construction. Data distribution, load balancing, and 
algorithms possessing a good balance between computa- 
tion and message passing are research topics in our 
parallel implemenfat ion, 

Raytracing Overview 

Generation of synthetic images using the raytracing 
technique was introduced by Appel 2 and MAGI ; * in 1908 
and then extended by Whit ted in 1980,4 Raytracing is a 
method for computing global illumination models. It 
determines surface visibilities computes shading for 
direct illumination, and computes an approximation to the 
global illumination problem by calculating reflections, 
refractions, and shadows. The algorithm traces simulated 
light rays throughout a scene of objects. The set of rays 
reaching the view position is used to calculate the illu- 
mination values for the screen pixels. These rays are 
traced backwards from the center of projection through 
ilu viewing plane into the environment. This approach 
makes it unnecessary to compute all the rays {an infinite 
number) in die scene. Only a finite number of rays 
needed for viewing are computed. 

An observer view position (the center of projection or 
l eyc 1T position) and a view T plane are specified by the user 
(Tig. 2). The raytracer begins by dividing a window on 
the view plane into a rectangular grid of points that 
correspond to pixels on the screen and then proceeds to 
determine the visibility of surfaces. For each pixel an eye 
ray is traced from the center of projection through the 
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Fig, 1. (a) A scene, amtputed using a local illurninaiii'ii i te1 ■>.:•■.•.:. 1 1 -i, h ■!- ih-tii^ renderings computed using global i]]umm;U:i;n 

algorithms. 



pixel out into the scene environment- The closest ray/ob- 
ject intersection is the visible point to be displayed at 
that pixel. For each visible light source in the scene, the 
direct illumination is computed at the point of intersec- 
tion using surface physics equations. The resulting illu- 
mination value contributes to the value of the color of 
the pixel. These rays are referred to as primary rays. 




Center of Projection 



Fig. 2. The raytraeing technique traces rays of li^ht from the 
vieuv?' ■, i- [center of projei n to ottfei ts in the sce$*e 



The rayiracing algorithm proceeds to calculate w bet her or 
not a point is in shadow, A point is not iit shadow if I he 
po4ni is visible from the light source. This is determined 
by sending a ray from the point oi intersection to the 
light source. If the ray intersects an opaque object on the 
way, the poirn is in shadow and the contribution of the 
shadow- rays light source to the surface illumination is 
not computed. However, if no objects intersect the ray, 
the point is visible to the light source and the light 
contribution is computed. Fig. 3a illustrates shadow 
processing. The point on the sphere surface receives light 
from light source A, bin not from light source B. 

A ray leaving the surface toward the view position has 
three components: diffuse reflection, Specular reflection, 
and a transmitted component. Specular and transmitted 
rays are determined by ilu* direction of the incoming ray 
and the laws of reflection and refraction. The light 
emitted by these rays is computed in the same manner as 
the primary ray and contributes bo the pixel correspond- 
ing to the primary ray. Figs. 3b and ;ic show the reflec- 
tion and transmitted rays of several Objects in a scene 
Diffuse reflection (the scattering of light equally in all 
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Fig. 3. Typrs nf iviys \-,\) Shadow. fl>) Reflection, (c") Refraction. 

directions) is approximated by a constant value, Accurate 
computation of the diffuse component requires the solving 
of energy balance equations as is done in the radiosity 
rendering algorithm. 1 ' 1 ' Diffuse interre flections can also 
be approximated using raytracing techniques* - 1 htil this 
requires excessive computation. 

The raytracing algorithm is applied recursively at each 
intersection point to generate new shadow, reflection, and 
refraction rays. Fig. 4 shows the light rays for an environ- 
ment. The rays form a ray tree as shown in Fig. 5. The 
nodes represent illumination values and the branches 
include all secondary rays generated from the primary 
ray. Conceptually, the tree is evaluated in bottom-up order 
with the parent's node value being a function of its 
children's illumination. The weighted sum of all the node 
colors defines the color or a pixel, A user-defined maxi- 
mum tree depth is commonly used to limit the size of ihe 
tree. It is evident from Figs. 4 and 5 that shadow rays 
dominate the total ray distribution. 

The basic operations in raytracing consist of generating 
rays and intersecting the rays with objects in the scene, 
An advantage of raytracing is that it is easy to incorpo- 
rate many different types of primitives such as polygons, 
spheres, cylinders, and more complex shapes such as 




LB 



parametric surfaces and fractal surfaces. The only require- 
ment to be able to use an object type is that there be a 
procedure for intersecting the object with a ray, One of 
the main challenges in raytracing is making the ray 
intersection operation efficient. Algorithmic techniques 
have been developed that include faster ray-object inter- 
sections, data structures to limit the number of ray -object 
intersections, sampling techniques to generate fewer rays, 
and faster hardware using distributed and parallel process 
in g ioji.12 q^ researcn effort concentrates on using data 
structures to limit the number of ray -object intersections 
and on using parallel techniques to accelerate the overall 
process. 

Spatial Subdivision 

Spatial subdivision data structures are one way to help 
limit the number of intersections by selecting relevant 
objects along the path of a ray as good candidates for ray 
intersection. Spatial subdivision methods partition a 
volume of space bounding the scene into smaller vol- 
umes, called voxels. Each voxel contains a list of objects 
wholly or partially within that voxel. This structuring 
yields a three-dimensional sort of the objects and allows 
the objects to be accessed in order along the ray path. 

We employ a spatial subdivision technique termed the 
hierarchical uniform grid 13 as the method of choice. This 
approach divides the world cube bounding I lie scene into 
a uniform three-dimensional grid with each voxel contain- 
ing a list of (he objects within the voxel (Fig. 6a). If a 
voxel contains too many objects, it is subdivided into a 
uniform grid of its own (Fig. 6b). Areas of the scene that 
are more populated are more finely subdivided, resulting 
in a hierarchy of grids that adapts to local scene com- 
plexities. 

The hierarchical uniform grid is used by the raytracer to 
choose which objects to intersect. We find the voxel in 
the grid that is first intersected by the ray. If that voxel 
contains objects, we intersect the ray with those objects. 
If one or more intersections occur within the voxel, the 
closest intersection to the ray origin is the visible point 
and secondary rays are spawned. If there are no intersec- 
tions or if the voxel is empty, we traverse the grid, to the 
next voxel and intersect the objects in the new voxel 
(Fig. 7a). The ray procedure ends if we exit the grici 




Fig. 4. Light sources, objects, and rays for an environment. 



Fig. 5. Hay tree fur the environment <tf Fig. 4, 
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Fig. 6. The hiprardiieul uniform gn<\ Spatial subdhrisiflfl fceehiuque 
fa) Tin--- wwM ettbe surrounding the scene Is dMded tato a uniform 
three-dimensional grift of -volumes called vo&efa (b , If a TOcel nm- 
u •. .Milnliviii^i (jito : > 1 1 1 1 i ■ ■ ' « • pid dflts 

nWTl. 

indicating thai no object in the scene intersects the ray 
(Fig. 7b). 

Grid traversal is fast because the £rid is uniform, allowing 
the use oT an incremental algorithm for traversing from 
voxel to voxel There 1 is a penalty for moving up and 
down the hierarchy lo different drifts hut this is the cost 
of having the data structure HfVirniiy adapt to the m 
complex it v 

Adjacent voxels are likely to contain tin* same object 
because objects may overlap several voxels. Two critical 
implementation details are included m avoid erroneous 



results and repeated ray intersections, First, the intersec- 
tion point of an object must occur within the current 
voxel. Second, intersection records, containing rhe ID of 
the last ray intersected with that object and the result, 
are stored with the object to prevent repeated intersec- 
tion cakulaHons of ihe same ray with the same object as 
the ray traverses the grid. 

ICE Overview 

II E is a parallel architecture composed of clusters of 
floating-point processors. Each cluster has four proces- 
sors and roughly 64M bytes of shared memory accessible 
for reading and writing by all four processors. Each 
processor has 4M bytes of local memory, which is used 
■Id private data and program code. The clusters 
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Fig. 7. Hit-ran -hit.-il m., ■■•!. \ i ■ I tim P .-l 'nUn tin i-iy Enter- 
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communicate using message passing links arid the system 
topology is configurable with a crossbar switch. There is 
a data path from the common display buffer to the dus- 
ter's shared memory. Fig- 8 shows the ICE architecture, 

Each shared memory can be configured to hold frame 
buffer data and/or can he used to hold data accessible by 
all four processors. The frame buffer data can be config- 
ured as a complete 128G-by-lG21 double buffered frame 
buffer of RGB and Z values or a re et angular block subset 
cil the frame buffer. The message passing links are used 
for inlerduster communication and access to common 
resources such as a disk array. The host workstation can 
broadcast into all local and shared memories via the 
message passing links. 

The frame buffers in each cluster's shared memory are 
connected by custom compositing hardware to a double 
buffered display frame buffer which is connected to a 
monitor. The compositing hardware removes the bot- 
tleneck from the cluster frame buffers to the single 
display frame buffer The compositing hardware can 
function in three different modes; Z buffer mode, alpha 
blend mode, and screen space subdivision mode. 

In Z buffer mode, the compositing hardware simulta- 
neously accesses the same pixel in all live cluster frame 
buffers, selects the one with the smallest '/, and stores it 
in the display buffer. This mode is used for Z-buffered 
polygon scan conversion, 

In alpha blend mode the same pixel is simultaneously 
accessed in all cluster frame buffers. Pixels from aujaceni 
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Fig. 8. Image Compute Engine (V E I areiftiteefcum 



data blocks are sorted into nearest and farthest and 
blended using I he blending rule: u * n.^re>i i ( 1 n.) 
x farthest, The final result is a blend of pixels from all 
the clusters and is presented to the display buffer. This 
mode is used in volumetric rendering of sampled data. 

In screen space subdivision mode, each cluster contains 
pixels from a subset of the screen and the compositing 
hardware simply gathers the pixels front the appropriate 
cluster This mode is used in raytracing applications. 

Parallel Raytracing on ICE 

Raytracing is well suited for parallelization because the 
task consists mainly of intersecting mi] lions of indepen- 
dent rays with objects in the model. Much research in 
recent years has concentrated on using multiprocessors 
to speed up the computation. Two approaches have 
been used to partition the problem among the proces- 
sors: image space subdivision and object space suh- 
division. 14 i*M<U7 .18,10 

In image space subdivision, processor nodes (clusters) 
are allocated a subset of the rays to compute and the 
entire data set is stored at each node. While this method 
achieves almost linear speed increases, it is not a feasible 
solution for rendering data sets that require more memory 
than is available on a processing node. With object space 
methods, computations (rays) and object data are both 
distributed to processing nodes and coordination between 
them is accomplished I h rough message passing between 
clusters. We have chosen an object space subdivision 
approach for implementation on ICE because of its ability 
to handle very large data sets. 

Parallel object space subdivision is efficient if it results in 
low intetprocessor communication and low processor idle 
tinu 1 . As we partition the computation and object data, 
several decisions need to be marie. How are the object 
dataT ray computations, and frame buffer pixels distrib- 
uted among the processor nodes? How are ray computa- 
tions and the corresponding object data brought together? 
How is load balancing accomplished? 

The sereen is subdivided into blocks of pixels which are 
assigned to clusters where they arc stored in the shared 
memory. When the picture is complete these pixels are 
gathered into the display frame buffer by the custom 
compositing chips. 

The spatial subdivision grid data structure is stored at 
every processing node, Voxels for which (he data is not 
stored locally are designated as remote voxels, The data 
associated with the voxels in the grid data structure is 
distributed among the clusters in a way that attempts to 
statically balance the computational load among the 
processor clusters. This is accomplished by grouping 
adjacent voxels into blocks and distributing the blocks 
among clusters so that each cluster contains many blocks 
selected uniformly from throughout the model space. 
Voxels distributed in this manner to a cluster are called 
the primary voxels for that cluster Voxels are distributed 
in blocks to maintain coherence along a ray and reduce 
intercluster communication (it is likely that the next 
voxel will be on the same cluster for several iterations of 
grid traversal i 
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The distribution of the voxels of a grid is performed for 
all the grids in the hierarchy so that all portions of the 
model that have a great deal of complexity are subdivided 
into voxels and distributed throughout the network of 
clusters. Thus, no matter where the viewpoint may be, 
whether zoomed in or not. the objects in the view, and 
Thus the computational load, should be vvell-distrihuted 
among the d processors. 

When the data is distributed among processing nodes as 
it is for large data sets, and we trace a ray through our 
grid data structure, we may come to a voxel in which the 
data resides on a different processing node, that is, a 
remote voxel At this point we send the ray information 
to the processing node that contains the data and the 
computation continues there* 

If the data associated with the primary distribution of the 
data set does not fill up a cluster's shared memory, 
additional data, which duplicates data in another cluster, 
is added. The first additional data added is data used to 
speed up shadow testing. This is important because 
shadow rays are launched from every ray-object intersec- 
tion point towards every light source, creating congestion 
at voxels containing light sources. To alleviate this, the 
data in voxels nearest a light source that fall wholly or 
partially within the cones defined by the light source 
(cone vertex) and the cluster's primary voxels (cone 
bases) are added to the data stored within the clusters 
shared memory. If there is still space available in shared 
memory after the shadow data is added, voxel data from 
voxels adjacent to the clusters primary voxel blocks is 
added. If there is space enough to si on- the complete 
data set in every cluster, that is done, 

Bach processor within a cluster maintains a workpool 
located in group shared memory, of jobs defined by either 
a primary or a secondary ray. As new rays are formed 
they are placed in a processor's workpool. When a 
processor finds its workpool empty it lakes jobs from its 
neighbors workpotil. This organization is intended to 
keep processors working on different parts of the rial a 
base to minimize group shared iiienmn across conflicts. 

Each cluster is responsible for determining which primary 
rays originate in its primary voxels and initializing its 
workpools accordingly, This can be done with knowledge 
of the viewing parameters by projecting the faces of 
certain primary voxels (those on faces of the world cube 
facing the eye position) onto the screen and noting which 
pixels are covered. Jobs consisting of primary rays are 
listed as runs on scan lines to minimize the job creation 
time. 

A ray is taken from the workpool by a processor in the 
cluster, which attempts to compute the remainder of the 
ray tree. Any rays, primary or secondary, that cannot be 
processed at a cluster because it does not contain the 
necessary voxel and Ms associated model data are for- 
warded to Ihe cluster that t Miitains flit- required voxel as 
a primary' voxel. A queue ol' rays is maintained for each 
possible destination duster; these pre periodically bundled 
into packets and 96211 out over Ihe message passing links. 




Fig. 9. Three scenes used fcc measure the rendering speed im- 
provonn-tLt of parallel processing over sequential processing. 

Each ray includes information about what pixel its color 
contribution must be accumulated in. These color con- 
tributions of rays may be computed in any of the clusters 
but the results are sent in ihe cluster that has responsi- 
bility for the portion of flu frame buffer <'» Plaining that 
pixel There, ihe contributions an accumulated in the 
pixel memory. 
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Ray tracking completion is determined using a scoreboard- 
ing technique* The host computer keeps a count of rays 
created and rays completed. Clusters send a message to 
the host when a ray is to be created, and the host 
increments its count of rays created. Similarly, when a 
ray completes in a cluster, the cluster tells the host and 
the host increments its count of rays completed, When 
these two counts are equal, the rendering job is done and 
the compositing hardware, operating in screen space 
subdivision mode, transfers all the frame buffer data from 
each cluster group shared memory to the display frame 
buffer. 

When static load balancing by uniform distribution of 
data among clusters and dynamic load balancing by 
commonly accessible workpools within clusters are 
inadequate, then dynamic load balancing between clusters 
is carried out Our plan for accomplishing this is to 
create workpools of rays for mutually exclusive blocks of 
voxels in each cluster, Rays are placed on the voxel 
workpool according to which voxel the ray has just 
entered, These workpools are organized as a linked list. 
Processors get a voxel workpool from the linked list for 
processing. In this w T ay, processors are w T orking on 
different regions of the data set, thereby reducing conten- 
tion for group shared memory. When a cluster exhausts 
its workpools it asks the other clusters for information on 
their workloads and requests from the busiest cluster one 
of its unused workpools together with its associated data. 

Results 

The ICE hardware, currently under construction, is 
expected to be completed in the spring of 1992. Parallel 
raytracing software in C has been written and simulations 
on an Apollo DN 10000 have been performed. The 
DN 10000 workstation has four processors and 12SM bytes 
of shared memory, similar to one cluster on ICE. 

The DN 10000 software includes a parallel programming 
toolset based on the Argonne National Laboratories 
macro set^° This toolset includes macros for implement- 
ing task creations and memory synchronization. Our 
simulation is of one cluster with workpools for dynamic 
load balancing within a cluster It is capable of rendering 
objects composed of polygons and spheres. 

Ki^. showy three scenes that were rendered sequen- 
tially and with the parallel software on the DN 1 000(1 
The teapot and the car are B-spline surface objects that 
have been tessellated into polygons. The teapot contains 
3600 polygons, the car contains 46,000 polygons, and the 
sphere flake contains 7300 spheres. Table I gives the 
rendering times in seconds for a screen of 500 by 500 
pixels. Each scene experienced at least a threefold 
speed improvement using four processors. 





Table 1 
Results 

(500 by 500 pixels) 






I 4 
Processor Processors 


Improvement 


Teapot 


422 s 


130 s 


3.2 


Car 


879 s 


288 s 


3.0 


Sphereflake 


1458 s 


392 s 


3,7 



Conclusions and Future Work 

An overview of the raytracing algorithm and a discussion 
of a parallel implementation of raytracing for the ICE 
architecture have been presented. A first version of the 
parallel software is running on an Apollo DN 10000 
yielding a threefold improvement in speed over the 
sequential software, The DN 10000 simulations provide a 
vehicle for parallel code development and statistical 
gal tiering of scene renderings, A version of the mulli clus- 
ter software is being written on the DN10O00 to develop 
code for simulation of message passing and load balanc- 
ing, We will have a version of the code to port directly to 
the ICE architecture when the hardware is finished. 

ICE will provide a platform for parallel algorithm develop- 
me nl and experimentation for a variety of graphics 
applications. Raytracing characteristics, such as grid size, 
ray tree depth, ray type distribution (shadow, reflection, 
refraction), and required interprocessor communication 
bandwidths, are scene dependent, making any sort of 
theoretical analysis d iff in ill The goal of our future work 
is an extremely fasi implementation of rayt racing capable 
of handling very large data sets. At the same time, we 
would like to develop an understanding of how best to 
distribute data and perform load balancing on the ICE 
architecture. 
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